Market-Mix-Modelling: Budget_Optimization¶ElecKart is an e-commerce firm based out of Ontario, Canada specialising in electronic products. Over the last one year, they had spent a significant amount of money on marketing. Occasionally, they had also offered big-ticket promotions (similar to the Big Billion Day). They are about to create a marketing budget for the next year, which includes spending on commercials, online campaigns, and pricing & promotion strategies. The CFO feels that the money spent over the last 12 months on marketing was not sufficiently impactful, and, that they can either cut on the budget or reallocate it optimally across marketing levers to improve the revenue response.
Imagine that you are a part of the marketing team working on budget optimisation. You need to develop a market mix model to observe the actual impact of different marketing variables over the last year. Using your understanding of the model, you have to recommend the optimal budget allocation for different marketing levers for the next year.*
The objective of this project is to create a market mix model for 3 product sub-categories - Camera accessory, Gaming accessory and Home Audio - to observe the actual impact of various marketing variables over one year (July 2015 to June 2016) and recommend the optimal budget allocation for different marketing levers for the next year.
# import libraries
import pandas as pd
import numpy as np
# For Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')
# Pandas Settings
# pd.options.display.float_format = '{:.1f}'.format
pd.set_option('display.max_rows', 40000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# Miscellaneous imports
from datetime import datetime
from scipy.stats import norm
import re
main_df = pd.read_csv('ConsumerElectronics.csv')
main_df.head()
SKU => Stock Keeping Unit
deliverybdays => days to get item or order from warehouse for shipping
deliverycdays => days to deliver item to customer
product_list = pd.read_excel('Media data and other information.xlsx', sheet_name='Product List')
product_list.columns.values[1]='product_analytic_vertical'
product_list.drop(product_list.columns[0], axis=1, inplace = True)
product_list.head()
media_investment = pd.read_excel('Media data and other information.xlsx', sheet_name='Media Investment', skiprows=2)
media_investment.drop(media_investment.columns[0], axis=1, inplace = True)
media_investment.head()
sale_calendar = pd.read_excel('Media data and other information.xlsx', sheet_name='Special Sale Calendar', \
skiprows=0, skipfooter=2)
sale_calendar.drop(sale_calendar.columns[0], axis=1, inplace = True)
sale_calendar.iloc[1:6, 0] = sale_calendar.iloc[0, 0]
sale_calendar.iloc[7:, 0] = sale_calendar.iloc[6, 0]
sale_calendar
net_promoter_score = pd.read_excel('Media data and other information.xlsx', sheet_name='Monthly NPS Score', \
skiprows=0)
net_promoter_score.columns.values[0]='score'
net_promoter_score
# pip install --pre python-docx
import pandas as pd
import io
import csv
from docx import Document
def read_docx_tables(filename, tab_id=None, **kwargs):
"""
parse table(s) from a Word Document (.docx) into Pandas DataFrame(s)
Parameters:
filename: file name of a Word Document
tab_id: parse a single table with the index: [tab_id] (counting from 0).
When [None] - return a list of DataFrames (parse all tables)
kwargs: arguments to pass to `pd.read_csv()` function
Return: a single DataFrame if tab_id != None or a list of DataFrames otherwise
"""
def read_docx_tab(tab, **kwargs):
vf = io.StringIO()
writer = csv.writer(vf)
for row in tab.rows:
writer.writerow(cell.text for cell in row.cells)
vf.seek(0)
return pd.read_csv(vf, **kwargs)
doc = Document(filename)
if tab_id is None:
return [read_docx_tab(tab, **kwargs) for tab in doc.tables]
else:
try:
return read_docx_tab(doc.tables[tab_id], **kwargs)
except IndexError:
print('Error: specified [tab_id]: {} does not exist.'.format(tab_id))
raise
dfs = read_docx_tables('Product Details.docx')
dfs[0]
# Dropping the extra row(s)/column(s)
product_details = pd.DataFrame(dfs[0])
product_details.drop(product_details.columns[0], axis=1, inplace = True)
product_details.head()
main_df.head()
main_df.dtypes
# String to datetime
main_df['order_date'] = pd.to_datetime(main_df['order_date'], format='%Y-%m-%d %H:%M:%S')
# Int to string
main_df[['order_id','order_item_id']] = main_df[['order_id','order_item_id']].astype(object)
# Int to string
main_df[['Year','Month']] = main_df[['Year','Month']].astype(str)
main_df['deliverybdays'] = pd.to_numeric(main_df['deliverybdays'], errors='coerce')
main_df['deliverybdays'].fillna(value=0, inplace=True)
main_df['deliverycdays'] = pd.to_numeric(main_df['deliverycdays'], errors='coerce')
main_df['deliverycdays'].fillna(value=0, inplace=True)
# String to int
main_df['gmv'] = pd.to_numeric(main_df['gmv'], errors='coerce')
main_df.dtypes
# Unique value frequencies
unique_values = pd.DataFrame(main_df.apply(lambda x: len(x.value_counts(dropna=False)), axis=0), columns=['Unique Value Count']).sort_values(by='Unique Value Count', ascending=True)
unique_values['dtype'] = pd.DataFrame(main_df.dtypes)
unique_values
initial_shape = main_df.shape
initial_shape
# Instances where GMV values are greater than MRP * units which is incorrect
print(main_df.loc[main_df['product_mrp'] * main_df['units'] < main_df['gmv']].shape[0])
print(round(100*(main_df.loc[main_df['product_mrp'] * main_df['units'] < main_df['gmv']].shape[0] / main_df.shape[0]), 2))
There are 38569 records (2.34%) in the dataframe where the GMV value is greater than the MRP * units.
gmv/units¶# update column based on another column
main_df.product_mrp = np.where(main_df['product_mrp'] < main_df['gmv'] / main_df['units'], \
main_df['gmv'] / main_df['units'], main_df['product_mrp'])
main_df.shape
Reinspecting instances where GMV value is greater than the MRP * units.
# Instances where GMV values are greater than MRP * units which is incorrect
print(main_df.loc[main_df['product_mrp'] * main_df['units'] < main_df['gmv']].shape[0])
print(round(100*(main_df.loc[main_df['product_mrp'] * main_df['units'] < main_df['gmv']].shape[0] / main_df.shape[0]), 2))
All erroneous instances were removed
0¶print(main_df.loc[main_df['gmv'] < 0].shape[0])
print(round(100*(main_df.loc[main_df['gmv'] < 0].shape[0]/main_df.shape[0]), 2))
0¶print(main_df.loc[main_df['product_mrp'] < 0].shape[0])
print(round(100*(main_df.loc[main_df['product_mrp'] < 0].shape[0]/main_df.shape[0]), 2))
0¶print(main_df.loc[main_df['units'] <= 0].shape[0])
print(round(100*(main_df.loc[main_df['units'] <= 0].shape[0]/main_df.shape[0]), 2))
No erroneous rows left in MRP, GMV or Units columns
deliverybdays & deliverycdays¶print(main_df.loc[main_df['deliverybdays'] < 0].shape[0])
print(round(100*(main_df.loc[main_df['deliverybdays'] < 0].shape[0]/main_df.shape[0]),4))
print(main_df.loc[main_df['deliverycdays'] < 0].shape[0])
print(round(100*(main_df.loc[main_df['deliverycdays'] < 0].shape[0]/main_df.shape[0]),4))
38 records(0.0023%) in the dataframe with negative values for deliverybdays.39 records(0.0024%) in the dataframe with negative values for deliverycdays.We will be dropping such rows since it is not possible for a product to have negative values for Dispatch Delay from Warehouse and to customer.
main_df = main_df.loc[(main_df['deliverybdays'] >= 0) & (main_df['deliverycdays'] >= 0)]
main_df.reset_index(drop=True, inplace=True)
main_df.shape
product_procurement_sla¶print(main_df.loc[main_df['product_procurement_sla'] < 0].shape[0])
print(round(100*(main_df.loc[main_df['product_procurement_sla'] < 0].shape[0]/main_df.shape[0]),4))
75986 records(4.61%) in the dataframe with negative values for product_procurement_sla.We will be dropping such rows since it is not possible for a product to have negative values for time typically taken to procure it.
main_df = main_df.loc[(main_df['product_procurement_sla'] >= 0)]
main_df.reset_index(drop=True, inplace=True)
main_df.shape
main_df[['product_procurement_sla']].describe().T
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(6, 4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("whitegrid") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
main_df.product_procurement_sla.hist()
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
main_df.product_procurement_sla.value_counts()
print(main_df.loc[main_df['product_procurement_sla'] >= 1000].shape[0])
print(round(100*(main_df.loc[main_df['product_procurement_sla'] >= 1000].shape[0]/main_df.shape[0]),4))
4745 records(0.3%) in the dataframe with unusually large values of product_procurement_sla.We will be dropping such rows since it is not possible for a product to have product_procurement_sla more than 1000
Retaining rows where the sla is less than 1000
#Retaining rows where the sla is less than 1000
main_df = main_df.loc[(main_df['product_procurement_sla'] < 1000)]
main_df.head()
main_df.shape
main_df.info()
cat_cols = [cname for cname in main_df.columns if main_df[cname].dtype == "object"]
cat_cols
# Filtering the string columns
cat_cols = ['fsn_id',
's1_fact.order_payment_type',
'product_analytic_super_category',
'product_analytic_category',
'product_analytic_sub_category',
'product_analytic_vertical']
for col in cat_cols:
main_df[col] = main_df[col].str.lower()
cat_cols = [cname for cname in main_df.columns if main_df[cname].dtype == "object"]
main_df[cat_cols].describe()
# Checking only those columns whose cardinality is less
selected_cat_cols = ['s1_fact.order_payment_type', 'product_analytic_super_category', 'product_analytic_category', 'product_analytic_sub_category', 'product_analytic_vertical']
for col in selected_cat_cols:
print('\n################################')
print('Unique values of ' + str(col))
print('################################')
print(pd.Series(main_df[col].unique()).sort_values(ascending=False))
There are no duplicates due to spelling mistakes
main_df.shape
print(main_df.duplicated().value_counts()[1])
print(round(100*(main_df.duplicated().value_counts()[1]/main_df.shape[0]),4))
main_df.drop_duplicates(keep='first', inplace=True)
main_df.shape
main_df.duplicated().value_counts()
No more duplicates
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(main_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(main_df.isnull().sum()/main_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
There are a few null values in the gmv column.
Also, there are quite a few Whitespaces present in some of the columns in the dataframe.
Let us try to first convert these white spaces to Nans and then we would treat them accordingly.
main_df.replace(' ', np.nan, inplace = True)
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(main_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(main_df.isnull().sum()/main_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
# removing rows where a particular column has high missing values because the column cannot be removed because of its importance
main_df = main_df[~pd.isnull(main_df['gmv'])]
org_shape = main_df.shape
org_shape
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(main_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(main_df.isnull().sum()/main_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
Finally no Null rows exist in the dataframe
Changing the datatype of the column 'order_date' to datetime64
main_df.dtypes
Identifying the data outside our analysis period - 01-Jul-2015 to 30-Jun-2016
main_df.loc[(main_df['order_date'].dt.floor("d") < '2015-07-01') | (main_df['order_date'].dt.floor("d") >= '2016-07-01')].shape
There are 592 records that fall outside our analysis period - 01-Jul-2015 to 30-Jun-2016
We will be dropping these records.
main_df = main_df.loc[(main_df['order_date'].dt.floor("d") >= '2015-07-01') & (main_df['order_date'].dt.floor("d") < '2016-07-01')]
main_df.shape
Max = pd.DataFrame(main_df[['order_date']].max().rename('Max'))
Min = pd.DataFrame(main_df[['order_date']].min().rename('Min'))
pd.concat([Min, Max], axis=1)
This verifies that our dataframe has data between July 2015 and June 2016 only.
# Creating a new column into a specific position in a DataFrame
loc_index = main_df.columns.get_loc('Month') + 1
main_df.insert(loc=loc_index,column='Week',value=main_df['order_date'].dt.strftime("%V"))
main_df['Year'] = main_df['Year'].astype('str')
main_df.head()
# Checking the combinations for any discrepancies
main_df.groupby(['Year', 'Week']).agg({'Month':"count"}).reset_index(drop=False)
# Updating the year as 2015 for the week whose week# is 53 but belongs to 2016 for consistency with the other data sets
# Updating the month as 12 for the above rows for consistency
# Dropping rows (less in number) with week# 27 as it belongs to the weeks in June 2015
main_df.loc[(main_df.Year == '2016') & (main_df.Week == '53'), 'Year'] = '2015'
main_df.loc[(main_df.Year == '2015') & (main_df.Week == '53'), 'Month'] = 12
main_df.drop(main_df[main_df['Week'] == '27'].index, inplace = True)
main_df.groupby(['Year', 'Week']).agg({'Month':"count"}).reset_index(drop=False)
main_df.info()
# Dropping Columns with Single Value or all Different Values
count_df = pd.DataFrame(main_df.apply(lambda x: len(x.value_counts()), axis=0), columns=['Count'])
drop_columns = list(count_df.loc[(count_df['Count']==1) | (count_df['Count']==len(main_df.index))].index)
print('Dropping these columns => {}'.format(drop_columns))
main_df.drop(drop_columns, axis=1, inplace=True)
# Dropping Columns which are insignificant to the analysis
drop_columns = ['fsn_id', 'order_id', 'order_item_id', 'cust_id']
main_df.drop(drop_columns, axis=1, inplace=True)
main_df.head()
List Price¶List Price = GMV/Units
# Creating a new column into a specific position in a DataFrame
loc_index = main_df.columns.get_loc('gmv') + 1
main_df.insert(loc=loc_index,column='list_price',value = main_df['gmv'] / main_df['units'])
main_df.head()
Displaying the column values for a few orders when units > 1
main_df.loc[main_df['units'] != 1][['gmv','list_price','product_mrp','units']].head()
If it is nearer to the salary day in Ontario(1st and 15th of every month), we flag the column as 1, else as 0
main_df['payday_flag'] = main_df['order_date'].apply(lambda x:1 if x.strftime('%d') in ('14','15','16','30','31','1','2') else 0)
main_df.head()
If it is a Holiday/Occassion in Ontario, we flag the column as 1, else as 0
The following table lists all the holidays/occassions from 1st July, 2015 to 30th June, 2016**
| Occassion | Day |
|---|---|
| Canada Day | July 1, 2015 |
| Civic Holiday | August 3, 2015 |
| Labour Day | September 7, 2015 |
| Thanksgiving | October 12, 2015 |
| Halloween | October 31, 2015 |
| Remembrance Day | November 11, 2015 |
| Christmas Day | December 25, 2015 |
| Boxing Day | December 26, 2015 |
| New Year's Day | January 1, 2016 |
| Islander Day | February 15, 2016 |
| Louis Riel Day | February 15, 2016 |
| Heritage Day | February 15, 2016 |
| Family Day | February 15, 2016 |
| Valentine's Day | February 14, 2016 |
| Leap Day | February 29, 2016 |
| St. Patrick's Day | March 17, 2016 |
| Good Friday | March 25, 2016 |
| Easter Monday | March 28, 2016 |
| Mother's Day | May 8, 2016 |
| Victoria Day | May 23, 2016 |
| Father's Day | June 19, 2016 |
| Aboriginal Day | June 21, 2016 |
| St. Jean Baptiste Day | June 24, 2016 |
**reference link: https://www.statutoryholidays.com/
def holidayflg(ord_date):
if ord_date.strftime('%Y') == 2015:
if ord_date.strftime('%m') == '07' and ord_date.strftime('%d') == '01':
return 1
elif ord_date.strftime('%m') == '08' and ord_date.strftime('%d') == '03':
return 1
elif ord_date.strftime('%m') == '09' and ord_date.strftime('%d') == '07':
return 1
elif ord_date.strftime('%m') == '10' and ord_date.strftime('%d') == '12':
return 1
elif ord_date.strftime('%m') == '10' and ord_date.strftime('%d') == '31':
return 1
elif ord_date.strftime('%m') == '11' and ord_date.strftime('%d') == '11':
return 1
elif ord_date.strftime('%m') == '12' and ord_date.strftime('%d') == '25':
return 1
elif ord_date.strftime('%m') == '12' and ord_date.strftime('%d') == '26':
return 1
else:
return 0
else:
if ord_date.strftime('%m') == '01' and ord_date.strftime('%d') == '01':
return 1
elif ord_date.strftime('%m') == '02' and ord_date.strftime('%d') == '15':
return 1
elif ord_date.strftime('%m') == '02' and ord_date.strftime('%d') == '14':
return 1
elif ord_date.strftime('%m') == '02' and ord_date.strftime('%d') =='29':
return 1
elif ord_date.strftime('%m') == '03' and ord_date.strftime('%d') == '17':
return 1
elif ord_date.strftime('%m') == '03' and ord_date.strftime('%d') == '25':
return 1
elif ord_date.strftime('%m') == '03' and ord_date.strftime('%d') == '28':
return 1
elif ord_date.strftime('%m') == '05' and ord_date.strftime('%d') == '08':
return 1
elif ord_date.strftime('%m') == '05' and ord_date.strftime('%d') == '23':
return 1
elif ord_date.strftime('%m') == '06' and ord_date.strftime('%d') == '19':
return 1
elif ord_date.strftime('%m') == '06' and ord_date.strftime('%d') == '21':
return 1
elif ord_date.strftime('%m') == '06' and ord_date.strftime('%d') == '24':
return 1
else:
return 0
main_df['occassion_flag'] = main_df['order_date'].apply(lambda x:holidayflg(x))
main_df.head()
If GMV value is greater than 80 percentile, then luxury, else mass_market
pd.DataFrame(main_df['gmv']).describe(percentiles=[.70,.80,.90]).T
main_df['gmv'].quantile(.8)
main_df['product_type'] = main_df['gmv'].apply(lambda x:'luxury' if x >= 2450 else 'mass_market')
main_df.head()
# Creating a new column into a specific position in a DataFrame
col_loc = main_df.columns.get_loc('list_price') + 1
main_df.insert(loc=col_loc, column='Discount%', value = \
round(100*((main_df['product_mrp'] - main_df['list_price']) / main_df['product_mrp']),2))
main_df.head()
main_df.describe().T
for col in main_df.describe().columns:
print('#############')
print(col)
print('#############')
main_df[col].hist()
plt.show()
cat_var = [cname for cname in main_df.columns if
main_df[cname].dtype == "object"]
main_df[cat_var].describe().T
main_df.head()
main_df.dtypes
# Function to plot the distribution plot of the numeric variable list
numeric_variables=['gmv','list_price','Discount%','deliverybdays','deliverycdays','sla','product_mrp','product_procurement_sla']
#Function to plot the distribution plot of the numeric variable list
def univariate_continuos(var_list):
plt.figure(figsize=(12,6))
for var in var_list:
plt.subplot(2,4,var_list.index(var)+1)
#plt.boxplot(country[var])
sns.boxplot(y=var,palette='cubehelix', data=main_df)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
univariate_continuos(numeric_variables)
There seems to be a lot of outliers in the columns. But all outliers may not be at random and we will have to be careful regarding which ones to drop.
# Checking outliers at 25%,50%,75%,90%,95% and 99%
main_df[numeric_variables].describe(percentiles=[.25,.5,.75,.90,.95,.99]).T
'gmv', 'list_price' and 'mrp' values can have occassional outliers owing to the fact that there are seldom some high value sales. These records are less frequent and not at all at random.'Discount%' can also be more in case on certain products. Specifically, a discount percentage of 100% would mena, the product came free with some other product.'deliverybdays', 'deliverycdays' and 'sla' seem to be erroneous data. So we will find a way to treat these outlier values so that they do not affect the predictive model while at the same time there will be enough data to build a generalizable model.main_df[['deliverybdays', 'deliverycdays' ,'sla']].quantile([.95, .99])
Let us assume any observation above 99 percentile for a variable is considered as an outlier for that variable.
print(round(100*(main_df.loc[main_df['deliverybdays'] > 8.0].shape[0] / main_df.shape[0]),4))
print(round(100*(main_df.loc[main_df['deliverycdays'] > 10.0].shape[0] / main_df.shape[0]),4))
sla¶print(round(100*(main_df.loc[main_df['sla'] > 13.0].shape[0] / main_df.shape[0]),4))
For the variables - 'SLA', 'deliverybdays', 'deliverybdays', 'gmv', 'product_mrp', 'list_price' where outliers are present, we will CAP the values above 99 percentile to the value corresponding to 99 percentile.
# Updating the outlier values with values corresponding to pre-decided percentiles
main_df['deliverybdays'][np.abs(main_df['deliverybdays'] > 8.0)]= 8.0
main_df['deliverybdays'][np.abs(main_df['deliverybdays'] > 10.0)]= 10.0
main_df['sla'][np.abs(main_df['sla'] > 13.0)]= 13.0
print(round(100*(main_df.shape[0] / initial_shape[0])))
So 89% records have been retained after outlier treatment
# Checking outliers at 25%,50%,75%,90%,95% and 99%
main_df.describe(percentiles=[.25,.5,.75,.90,.95,.99]).T
univariate_continuos(numeric_variables)
# Checking the skewness in the dataset
main_df[numeric_variables].skew()
num_col = ['gmv','deliverybdays','deliverycdays','list_price','product_mrp','product_procurement_sla']
#Function to plot the distribution plot of the numeric variable list
def distplots(var_list):
plt.figure(figsize=(15,6))
for var in var_list:
plt.subplot(2,3,var_list.index(var)+1)
sns.distplot(main_df[var], fit=norm, kde=False, hist=False)
#sns.boxplot(y=var,palette='cubehelix', data=main_df)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
distplots(num_col)
main_df.isnull().values.any()
Below function takes input an year and generates the numbers of the weeks for each month in it
import calendar
def WeekFinderFromYear(year):
""" will return all the week from selected year """
import datetime
WEEK = {'MONDAY':0,'TUESDAY':1,'WEDNESDAY':2,'THURSDAY':3,'FRIDAY':4,'SATURDAY':5,'SUNDAY':6}
MONTH = {'JANUARY':1,'FEBRUARY':2,'MARCH':3,'APRIL':4,'MAY':5,'JUNE':6,'JULY':7,'AUGUST':8,'SEPTEMBER':9,'OCTOBER':10,'NOVEMBER':11,'DECEMBER':12}
year = int(year)
month = MONTH['JANUARY']
day = WEEK['MONDAY']
dt = datetime.date(year, month, 1)
dow_lst = []
while dt.weekday() != day:
dt = dt + datetime.timedelta(days = 1)
lst_month = MONTH.values()
for mont in lst_month:
while dt.month == mont:
dow_lst.append(dt)
dt = dt + datetime.timedelta(days=7)
month_number_week = {1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[], 10:[], 11:[], 12:[]}
month_name_week = {'Jan':[], 'Feb':[], 'Mar':[], 'Apr':[], 'May':[], 'Jun':[], 'Jul':[], 'Aug':[], 'Sep':[], 'Oct':[], 'Nov':[], 'Dec':[]}
for each in dow_lst:
month_number_week[each.month].append(each.isocalendar()[1])
month_name_week[calendar.month_abbr[each.month]].append(each.isocalendar()[1])
return month_number_week, month_name_week
dict_number_2015, dict_name_2015 = WeekFinderFromYear(2015)
print(dict_number_2015)
print()
print(dict_name_2015)
print()
dict_number_2016, dict_name_2016 = WeekFinderFromYear(2016)
print(dict_number_2016)
print()
print(dict_name_2016)
num_weeks_2015 = 0
num_weeks_2016 = 0
for i in dict_number_2015:
if i >= 7:
num_weeks_2015 += len(dict_number_2015[i])
for i in dict_number_2016:
if i <= 6:
num_weeks_2016 += len(dict_number_2016[i])
total_weeks = num_weeks_2015 + num_weeks_2016
total_weeks
As expected there are a total of 52 weeks in our dataset
We will generate weekly data from Year and Month for the media investment data
media_investment.head()
# Imputing nulls with 0
media_investment['Radio'].fillna(value=0, inplace=True)
media_investment['Other'].fillna(value=0, inplace=True)
media_investment.head()
# Temp DataFrame
temp_media_investment = pd.DataFrame(index=range(total_weeks), columns=['Year', 'Month', 'Week', 'Total Investment', 'TV', 'Digital', 'Sponsorship', 'Content Marketing', 'Online marketing', 'Affiliates', 'SEM', 'Radio', 'Other'])
temp_media_investment.head()
We divide the monthly values by the number of weeks in that month and take that as weekly data
i = 0
for index, row in media_investment.iterrows():
if row.Year == 2015:
num_weeks = len(dict_number_2015[row.Month])
for week in dict_number_2015[row.Month]:
temp_media_investment.iloc[i].Year = row.Year
temp_media_investment.iloc[i].Month = row.Month
temp_media_investment.iloc[i]['Week'] = week
temp_media_investment.iloc[i]['Total Investment'] = round(row['Total Investment'] / num_weeks, 3)
temp_media_investment.iloc[i]['TV'] = round(row['TV'] / num_weeks, 3)
temp_media_investment.iloc[i]['Digital'] = round(row['Digital'] / num_weeks, 3)
temp_media_investment.iloc[i]['Sponsorship'] = round(row['Sponsorship'] / num_weeks, 3)
temp_media_investment.iloc[i]['Content Marketing'] = round(row['Content Marketing'] / num_weeks, 3)
temp_media_investment.iloc[i]['Online marketing'] = round(row['Online marketing'] / num_weeks, 3)
temp_media_investment.iloc[i]['Affiliates'] = round(row[' Affiliates'] / num_weeks, 3)
temp_media_investment.iloc[i]['SEM'] = round(row['SEM'] / num_weeks, 3)
temp_media_investment.iloc[i]['Radio'] = round(row['Radio'] / num_weeks, 3)
temp_media_investment.iloc[i]['Other'] = round(row['Other'] / num_weeks, 3)
i+=1
elif row.Year == 2016:
num_weeks = len(dict_number_2016[row.Month])
for week in dict_number_2016[row.Month]:
temp_media_investment.iloc[i].Year = row.Year
temp_media_investment.iloc[i].Month = row.Month
temp_media_investment.iloc[i]['Week'] = week
temp_media_investment.iloc[i]['Total Investment'] = round(row['Total Investment'] / num_weeks, 3)
temp_media_investment.iloc[i]['TV'] = round(row['TV'] / num_weeks, 3)
temp_media_investment.iloc[i]['Digital'] = round(row['Digital'] / num_weeks, 3)
temp_media_investment.iloc[i]['Sponsorship'] = round(row['Sponsorship'] / num_weeks, 3)
temp_media_investment.iloc[i]['Content Marketing'] = round(row['Content Marketing'] / num_weeks, 3)
temp_media_investment.iloc[i]['Online marketing'] = round(row['Online marketing'] / num_weeks, 3)
temp_media_investment.iloc[i]['Affiliates'] = round(row[' Affiliates'] / num_weeks, 3)
temp_media_investment.iloc[i]['SEM'] = round(row['SEM'] / num_weeks, 3)
temp_media_investment.iloc[i]['Radio'] = round(row['Radio'] / num_weeks, 3)
temp_media_investment.iloc[i]['Other'] = round(row['Other'] / num_weeks, 3)
i+=1
temp_media_investment.head()
temp_media_investment['Week'] = temp_media_investment['Week'].astype('str')
media_investment = temp_media_investment
media_investment.head()
# Checking for duplicates
media_investment.duplicated('Week').value_counts()
No duplicates
# Checking for nulls
media_investment.isnull().values.any()
No Nulls
# Dropping Year and Month columns as we won't be needing them anymore
del media_investment['Year']
del media_investment['Month']
media_investment.head()
original_col = media_investment.columns[1:]
original_col
def EMA_variables(df,var,n):
for i in var:
loc_index = df.columns.get_loc(i) + 1
df.insert(loc=loc_index,column= i+'_EMA_'+np.str(n),value=df[i].ewm(span=n, adjust=False).mean())
return df
media_investment = EMA_variables(media_investment,original_col,8)
media_investment.head()
def SMA_variables(df,var,n):
for i in var:
loc_index = df.columns.get_loc(i) + 1
df.insert(loc=loc_index,column= i+'_SMA_'+np.str(n),value=df[i].rolling(window=n).mean())
return df
media_investment = SMA_variables(media_investment,original_col,5)
media_investment.head()
media_investment = SMA_variables(media_investment,original_col,3)
media_investment.head()
def calculate_ad_stocks(data, engagement_factor):
"""initialize ad stock vectors"""
TV_ad_stock_vector = []
Digital_ad_stock_vector = []
Sponsorship_ad_stock_vector = []
Content_Marketing_ad_stock_vector = []
Online_marketing_ad_stock_vector = []
Affiliates_ad_stock_vector = []
SEM_ad_stock_vector = []
Radio_ad_stock_vector = []
Other_ad_stock_vector = []
Total_Investment_ad_stock_vector = []
"""initialize ad_stock_value"""
tv_ad_stock_value = 0
Digital_ad_stock_value = 0
Sponsorship_ad_stock_value = 0
Content_Marketing_ad_stock_value = 0
Online_marketing_ad_stock_value = 0
Affiliates_ad_stock_value = 0
SEM_ad_stock_value = 0
Radio_ad_stock_value = 0
Other_ad_stock_value = 0
Total_Investment_ad_stock_value = 0
"""loop through dataset to calculate adstock values. The formula for adstock is: At = Xt + adstock rate * At-1."""
for index, row in data.iterrows():
tv_ad_stock_value = row['TV'] + engagement_factor * tv_ad_stock_value
TV_ad_stock_vector.append(tv_ad_stock_value)
Digital_ad_stock_value = row['Digital'] + engagement_factor * Digital_ad_stock_value
Digital_ad_stock_vector.append(Digital_ad_stock_value)
Sponsorship_ad_stock_value = row['Sponsorship'] + engagement_factor * Sponsorship_ad_stock_value
Sponsorship_ad_stock_vector.append(Sponsorship_ad_stock_value)
Content_Marketing_ad_stock_value = row['Content Marketing'] + engagement_factor * Content_Marketing_ad_stock_value
Content_Marketing_ad_stock_vector.append(Content_Marketing_ad_stock_value)
Online_marketing_ad_stock_value = row['Online marketing'] + engagement_factor * Online_marketing_ad_stock_value
Online_marketing_ad_stock_vector.append(Online_marketing_ad_stock_value)
Affiliates_ad_stock_value = row['Affiliates'] + engagement_factor * Affiliates_ad_stock_value
Affiliates_ad_stock_vector.append(Affiliates_ad_stock_value)
SEM_ad_stock_value = row['SEM'] + engagement_factor * SEM_ad_stock_value
SEM_ad_stock_vector.append(SEM_ad_stock_value)
Radio_ad_stock_value = row['Radio'] + engagement_factor * Radio_ad_stock_value
Radio_ad_stock_vector.append(Radio_ad_stock_value)
Other_ad_stock_value = row['Other'] + engagement_factor * Other_ad_stock_value
Other_ad_stock_vector.append(Other_ad_stock_value)
Total_Investment_ad_stock_value = row['Total Investment'] + engagement_factor * Total_Investment_ad_stock_value
Total_Investment_ad_stock_vector.append(Total_Investment_ad_stock_value)
"""add ad stock vector to dataset"""
loc_index = data.columns.get_loc('TV_EMA_8') + 1
data.insert(loc=loc_index,column='TV_Ad_Stock',value=TV_ad_stock_vector)
loc_index = data.columns.get_loc('Digital_EMA_8') + 1
data.insert(loc=loc_index,column='Digital_Ad_Stock',value=Digital_ad_stock_vector)
loc_index = data.columns.get_loc('Sponsorship_EMA_8') + 1
data.insert(loc=loc_index,column='Sponsorship_Ad_Stock',value=Sponsorship_ad_stock_vector)
loc_index = data.columns.get_loc('Content Marketing_EMA_8') + 1
data.insert(loc=loc_index,column='Content_Marketing_Ad_Stock',value=Content_Marketing_ad_stock_vector)
loc_index = data.columns.get_loc('Online marketing_EMA_8') + 1
data.insert(loc=loc_index,column='Online_marketing_Ad_Stock',value=Online_marketing_ad_stock_vector)
loc_index = data.columns.get_loc('Affiliates_EMA_8') + 1
data.insert(loc=loc_index,column='Affiliates_Ad_Stock',value=Affiliates_ad_stock_vector)
loc_index = data.columns.get_loc('SEM_EMA_8') + 1
data.insert(loc=loc_index,column='SEM_Ad_Stock',value=SEM_ad_stock_vector)
loc_index = data.columns.get_loc('Radio_EMA_8') + 1
data.insert(loc=loc_index,column='Radio_Ad_Stock',value=Radio_ad_stock_vector)
loc_index = data.columns.get_loc('Other_EMA_8') + 1
data.insert(loc=loc_index,column='Other_Ad_Stock',value=Other_ad_stock_vector)
loc_index = data.columns.get_loc('Total Investment_EMA_8') + 1
data.insert(loc=loc_index,column='Total_Investment_Ad_Stock',value=Total_Investment_ad_stock_vector)
return data
media_investment = calculate_ad_stocks(data=media_investment, engagement_factor=0.6)
media_investment.head(10)
media_investment.fillna(value=0, inplace=True)
media_investment.head(10)
media_investment.shape
net_promoter_score
# resetting index
net_promoter_score.reset_index(drop=True, inplace=True)
# Transposing the dataframe
net_promoter_score = net_promoter_score.T
# resetting index
net_promoter_score.reset_index(drop=False, inplace=True)
# renaming columns
net_promoter_score.columns = ['Month', 'NPS', 'Stock Index']
net_promoter_score.drop(net_promoter_score.index[0], inplace=True)
net_promoter_score
# Temp DataFrame
temp_nps = pd.DataFrame(index=range(total_weeks), columns=['Year', 'Month', 'Week', 'NPS', 'Stock Index'])
temp_nps
We use the same NPS Score and Stock Index values for each week of a month which is the NPS Score and Stock Index of the entire month
i = 0
for index, row in net_promoter_score.iterrows():
if '15' in row['Month']:
month = re.split("[']", row['Month'])[0][0:3]
for week in dict_name_2015[month]:
temp_nps.iloc[i].Year = 2015
temp_nps.iloc[i].Month = month
temp_nps.iloc[i].Week = week
temp_nps.iloc[i].NPS = row['NPS']
temp_nps.iloc[i]['Stock Index'] = row['Stock Index']
i+=1
elif '16' in row['Month']:
month = re.split("[']", row['Month'])[0][0:3]
for week in dict_name_2016[month]:
temp_nps.iloc[i].Year = 2016
temp_nps.iloc[i].Month = month
temp_nps.iloc[i].Week = week
temp_nps.iloc[i].NPS = row['NPS']
temp_nps.iloc[i]['Stock Index'] = row['Stock Index']
i+=1
temp_nps.head()
temp_nps['Week'] = temp_nps['Week'].astype('str')
net_promoter_score = temp_nps
net_promoter_score
net_promoter_score.duplicated('Week').value_counts()
No duplicates
net_promoter_score.isnull().values.any()
No Nulls
# Dropping Yeat and Month columns as we won't be needing them anymore
del net_promoter_score['Year']
del net_promoter_score['Month']
net_promoter_score.head()
nps_original_col = net_promoter_score.columns[1:]
nps_original_col
net_promoter_score = SMA_variables(net_promoter_score,nps_original_col,5)
net_promoter_score.head()
net_promoter_score = SMA_variables(net_promoter_score,nps_original_col,3)
net_promoter_score.head()
net_promoter_score.fillna(value=0, inplace=True)
net_promoter_score.head()
We will generate weekly data from Year and Month for the sale calendar data
sale_calendar
# Getting the data into required format
sale_calendar.columns = ['Year', 'Sale']
sale_calendar.Year = sale_calendar.Year.apply(lambda x: int(x))
sale_calendar.Sale = sale_calendar.Sale.apply(lambda x: x.replace('th','').strip())
sale_calendar.Sale = sale_calendar.Sale.apply(lambda x: x.replace('rd','').strip())
sale_calendar.Sale = sale_calendar.Sale.apply(lambda x: x.replace('(','').strip())
sale_calendar.Sale = sale_calendar.Sale.apply(lambda x: x.replace(')','').strip())
# Splitting 25-31 Dec into two different rows for the ease of weekly data generation
sale_calendar.iloc[5, sale_calendar.columns.get_loc('Sale')] = '25-31 Dec'
sale_calendar = sale_calendar.append({'Year' : 2016 , 'Sale' : '1-3 Jan'} , ignore_index=True)
sale_calendar.sort_values(by=['Year'], ascending = True, inplace = True)
sale_calendar
# Dropping '17-15 Oct' entry assuming it is erroneous
sale_calendar.drop(sale_calendar.index[3], inplace=True)
sale_calendar
# Temp DataFrame
temp_sale_calendar = pd.DataFrame(index=range(total_weeks), columns=['Year', 'Month', 'Week', 'Sale'])
temp_sale_calendar
%V - ISO 8601 week as a decimal number with Monday as the first day of the week.
We find the number of days in each week of a month on which there was a sale
i = 0
import datetime
for index, row in sale_calendar.iterrows():
week_list = []
month_list = []
year_list = []
date1, date2, month = re.split("[- ]", row.Sale)
month = month[0:3] # Taking only the first 3 character of the month name
year = row.Year
# print("Date => {}-{}-{}".format(date1, month, year))
# print(datetime.datetime.strptime('{}-{}-{}'.format(date1, month, year), "%d-%b-%Y").strftime("%V"))
# print("Date => {}-{}-{}".format(date2, month, year))
# print(datetime.datetime.strptime('{}-{}-{}'.format(date2, month, year), "%d-%b-%Y").strftime("%V"))
date = int(date1)
while date <= int(date2):
# Extracting the week numbers of the range of dates in each month
week = datetime.datetime.strptime('{}-{}-{}'.format(date, month, year), "%d-%b-%Y").strftime("%V")
week_list.append(int(week))
date+=1
week_dict = dict.fromkeys(week_list, 0) # Generating a week list dictionary with default value as 0
for j in week_list:
week_dict[j] = week_dict[j] + 1
for key, value in week_dict.items():
temp_sale_calendar.iloc[i].Year = year
temp_sale_calendar.iloc[i].Month = month
temp_sale_calendar.iloc[i].Week = key
temp_sale_calendar.iloc[i].Sale = value
i+=1
temp_sale_calendar.dropna(axis=0, how='all', inplace=True) # Drop all rows which has all Nulls
temp_sale_calendar.sort_values(by=['Week'], ascending = True, inplace = True)
# Handling the special cases of 53 and 7 week numbers which appear twice
temp_sale_calendar.loc[temp_sale_calendar['Week'] == 53, 'Sale'] = temp_sale_calendar.loc[temp_sale_calendar['Week'] == 53, 'Sale'].sum()
temp_sale_calendar.loc[temp_sale_calendar['Week'] == 7, 'Sale'] = temp_sale_calendar.loc[temp_sale_calendar['Week'] == 7, 'Sale'].sum()
# Dropping the redundant 53 and 7 week numbers
temp_sale_calendar.drop_duplicates(subset=['Week'], keep='first', inplace=True)
temp_sale_calendar.reset_index(drop=True, inplace=True)
temp_sale_calendar.head()
temp_sale_calendar['Week'] = temp_sale_calendar['Week'].astype('str')
sale_calendar = temp_sale_calendar
sale_calendar.head()
sale_calendar.duplicated('Week').value_counts()
No duplicates
sale_calendar.isnull().values.any()
No Nulls
# Dropping Yeat and Month columns as we won't be needing them anymore
del sale_calendar['Year']
del sale_calendar['Month']
We will generate weekly data from date for the climate data of 2015
# Skipping the first 24 rows while reading the data to ignore the metadata
climate_2015 = pd.read_csv('ONTARIO-2015.csv', skiprows=24)
climate_2015.head()
climate_2015.columns
# Dropping the columns with all nulls
climate_2015.dropna(axis=1, thresh=1, inplace=True)
climate_2015.reset_index(drop=True, inplace=True)
climate_2015.head()
climate_2015.columns
# Dropping columns with only one distinct value
drop_cols = []
for col in climate_2015.columns:
if 'Flag' in col:
print(climate_2015[col].value_counts())
print()
drop_cols.append(col)
drop_cols.append('Data Quality')
climate_2015.drop(drop_cols, axis=1, inplace=True)
print(climate_2015.columns)
# Formatting column names
climate_2015.columns = [col.replace(' (°C)','').strip() for col in climate_2015.columns]
climate_2015.columns
climate_2015.info()
# Dropping rows where all the columns have nulls
cols = ['Max Temp',
'Min Temp',
'Mean Temp',
'Heat Deg Days',
'Cool Deg Days',
'Total Rain (mm)',
'Total Snow (cm)',
'Total Precip (mm)',
'Snow on Grnd (cm)']
climate_2015.dropna(subset=cols, inplace=True)
climate_2015.reset_index(drop=True, inplace=True)
climate_2015.info()
# Extracting week # from the date field and formatting it
climate_2015['Week'] = climate_2015['Date/Time'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d").strftime("%V"))
climate_2015['Week'] = climate_2015['Week'].apply(lambda x: str(x).lstrip('0'))
climate_2015.head()
# Extracting July to Dec month's data
climate_2015 = climate_2015[climate_2015['Month'] >= 7]
climate_2015.reset_index(drop=True, inplace=True)
climate_2015.head()
climate_2015.sort_values(by=['Week'], ascending = True, inplace = True)
# Getting weekly data
climate_2015 = climate_2015.groupby(['Week']).agg({'Max Temp':"max", 'Min Temp':"min", 'Mean Temp':"mean", 'Heat Deg Days':"mean", 'Cool Deg Days':"mean", 'Total Rain (mm)':"mean", 'Total Snow (cm)':"mean", 'Total Precip (mm)':"mean", 'Snow on Grnd (cm)':"mean"}).reset_index(drop=False)
climate_2015.head()
# Dropping 27th week data as it belongs to June 2015
climate_2015.drop(climate_2015[climate_2015['Week'] == '27'].index, inplace = True)
climate_2015.duplicated().any()
No duplicates
climate_2015.isnull().values.any()
No Nulls
We will generate weekly data from date for the climate data of 2016
# Skipping the first 24 rows while reading the data to ignore the metadata
climate_2016 = pd.read_csv('ONTARIO-2016.csv', skiprows=24)
climate_2016.head()
climate_2016.columns
# Dropping the columns with all nulls
climate_2016.dropna(axis=1, thresh=1, inplace=True)
climate_2016.reset_index(drop=True, inplace=True)
climate_2016.head()
climate_2016.columns
# Dropping columns with only one distinct value
drop_cols = []
for col in climate_2016.columns:
if 'Flag' in col:
print(climate_2016[col].value_counts())
print()
drop_cols.append(col)
drop_cols.append('Data Quality')
climate_2016.drop(drop_cols, axis=1, inplace=True)
print(climate_2016.columns)
# Formatting column names
climate_2016.columns = [col.replace(' (°C)','').strip() for col in climate_2016.columns]
climate_2016.columns
climate_2016.info()
# Dropping rows where all the columns have nulls
cols = ['Max Temp',
'Min Temp',
'Mean Temp',
'Heat Deg Days',
'Cool Deg Days',
'Total Rain (mm)',
'Total Snow (cm)',
'Total Precip (mm)',
'Snow on Grnd (cm)']
climate_2016.dropna(subset=cols, inplace=True)
climate_2016.reset_index(drop=True, inplace=True)
climate_2016.info()
# Extracting week # from the date field and formatting it
climate_2016['Week'] = climate_2016['Date/Time'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d").strftime("%V"))
climate_2016['Week'] = climate_2016['Week'].apply(lambda x: str(x).lstrip('0'))
climate_2016.head()
# Extracting Jan to June month's data
climate_2016 = climate_2016[climate_2016['Month'] <= 6]
climate_2016.reset_index(drop=True, inplace=True)
climate_2016.head()
climate_2016.tail()
climate_2016.sort_values(by=['Week'], ascending = True, inplace = True)
# Getting weekly data
climate_2016 = climate_2016.groupby(['Week']).agg({'Max Temp':"max", 'Min Temp':"min", 'Mean Temp':"mean", 'Heat Deg Days':"mean", 'Cool Deg Days':"mean", 'Total Rain (mm)':"mean", 'Total Snow (cm)':"mean", 'Total Precip (mm)':"mean", 'Snow on Grnd (cm)':"mean"}).reset_index(drop=False)
climate_2016.head()
climate_2016.duplicated().any()
No duplicates
climate_2016.isnull().values.any()
No Nulls
def list_diff(list1, list2):
return (list(set(list1) - set(list2)))
print(list_diff(list(climate_2015.columns), list(climate_2016.columns)))
print(climate_2015.shape)
print(climate_2016.shape)
Concatenating the climate data sets to form a single one to be merged with the order data set
climate = pd.concat([climate_2015, climate_2016], axis = 0)
climate.shape
climate.info()
climate.head()
climate.Week.unique()
main_df.shape
main_df['product_analytic_sub_category'].value_counts()
cameraaccessory_df = main_df.loc[main_df['product_analytic_sub_category']=='cameraaccessory']
gamingaccessory_df = main_df.loc[main_df['product_analytic_sub_category']=='gamingaccessory']
homeaudio_df = main_df.loc[main_df['product_analytic_sub_category']=='homeaudio']
print('No of rows in cameraaccessory_df: {}'.format(cameraaccessory_df.shape[0]))
print('No of rows in gamingaccessory_df: {}'.format(gamingaccessory_df.shape[0]))
print('No of rows in homeaudio_df: {}'.format(homeaudio_df.shape[0]))
cameraaccessory_df['is_cod'] = cameraaccessory_df['s1_fact.order_payment_type'].apply(lambda x:1 if x=='cod' else 0)
gamingaccessory_df['is_cod'] = gamingaccessory_df['s1_fact.order_payment_type'].apply(lambda x:1 if x=='cod' else 0)
homeaudio_df['is_cod'] = homeaudio_df['s1_fact.order_payment_type'].apply(lambda x:1 if x=='cod' else 0)
cameraaccessory_df['is_mass_market'] = cameraaccessory_df['product_type'].apply(lambda x:1 if x=='mass_market' else 0)
gamingaccessory_df['is_mass_market'] = gamingaccessory_df['product_type'].apply(lambda x:1 if x=='mass_market' else 0)
homeaudio_df['is_mass_market'] = homeaudio_df['product_type'].apply(lambda x:1 if x=='mass_market' else 0)
cameraaccessory_df.head()
Deleting the columns 's1_fact.order_payment_type' & 'product_type'
# Dropping Columns which are insignificant to the analysis
drop_columns = ['s1_fact.order_payment_type','product_type']
cameraaccessory_df.drop(drop_columns, axis=1, inplace=True)
gamingaccessory_df.drop(drop_columns, axis=1, inplace=True)
homeaudio_df.drop(drop_columns, axis=1, inplace=True)
product_analytic_vertical) with multiple levels, creating dummy features (one-hot encoded)¶# Creating dummy variables for the remaining categorical variable
dummy1 = pd.get_dummies(cameraaccessory_df[['product_analytic_vertical']], prefix='product_vertical', drop_first=True)
dummy2 = pd.get_dummies(gamingaccessory_df[['product_analytic_vertical']], prefix='product_vertical', drop_first=True)
dummy3 = pd.get_dummies(homeaudio_df[['product_analytic_vertical']], prefix='product_vertical', drop_first=True)
# Adding the results to the original dataframes
cameraaccessory_df = pd.concat([cameraaccessory_df, dummy1], axis=1)
gamingaccessory_df = pd.concat([gamingaccessory_df, dummy2], axis=1)
homeaudio_df = pd.concat([homeaudio_df, dummy3], axis=1)
# removing columns
cameraaccessory_df = cameraaccessory_df.drop('product_analytic_vertical', axis=1)
gamingaccessory_df = gamingaccessory_df.drop('product_analytic_vertical', axis=1)
homeaudio_df = homeaudio_df.drop('product_analytic_vertical', axis=1)
print('Shape of cameraaccessory_df: {}'.format(cameraaccessory_df.shape))
print('Shape of gamingaccessory_df: {}'.format(gamingaccessory_df.shape))
print('Shape of homeaudio_df: {}'.format(homeaudio_df.shape))
# Unique value frequencies
unique_values = pd.DataFrame(cameraaccessory_df.apply(lambda x: len(x.value_counts(dropna=False)), axis=0), \
columns=['Unique Value Count']).sort_values(by='Unique Value Count', ascending=True)
unique_values['dtype'] = pd.DataFrame(cameraaccessory_df.dtypes)
unique_values.head()
# Unique value frequencies
unique_values = pd.DataFrame(gamingaccessory_df.apply(lambda x: len(x.value_counts(dropna=False)), axis=0), \
columns=['Unique Value Count']).sort_values(by='Unique Value Count', ascending=True)
unique_values['dtype'] = pd.DataFrame(gamingaccessory_df.dtypes)
unique_values.head()
# Unique value frequencies
unique_values = pd.DataFrame(homeaudio_df.apply(lambda x: len(x.value_counts(dropna=False)), axis=0), \
columns=['Unique Value Count']).sort_values(by='Unique Value Count', ascending=True)
unique_values['dtype'] = pd.DataFrame(homeaudio_df.dtypes)
unique_values.head()
Thus we see that in all 3 dfs, columns 'product_analytic_category' & 'product_analytic_sub_category' have only 1 unique value. Hence we will drop these 2 columns from the 3 dataframes.
# Dropping Columns which are insignificant to the analysis
drop_columns = ['product_analytic_category', 'product_analytic_sub_category']
cameraaccessory_df.drop(drop_columns, axis=1, inplace=True)
gamingaccessory_df.drop(drop_columns, axis=1, inplace=True)
homeaudio_df.drop(drop_columns, axis=1, inplace=True)
print('Are there any null values in cameraaccessory_df: {}'.format(cameraaccessory_df.isnull().values.any()))
print('Are there any null values in gamingaccessory_df: {}'.format(gamingaccessory_df.isnull().values.any()))
print('Are there any null values in homeaudio_df: {}'.format(homeaudio_df.isnull().values.any()))
Rolling up order data to get weekly data by taking a mean of all the values per week
list_price and product_mrp. These features are closely related to gmv(revenue). However they dont add any business value. It is obvious that having a high mpr or list_price will increase the revenue. But that wont be a good business decision to maximise revenue.¶Units also is not a very important feature from business perspective as its obvious that higher the no of units sold, more will be the revenue. The challenge of prediction is to find attributes, tuning which, customers will buy more product units, thereby increasing revenue.¶cameraaccessory_df = cameraaccessory_df.groupby(['Week']).agg({'gmv':"sum",'Discount%':'mean','deliverybdays': "mean", \
'deliverycdays':'mean','sla':'mean', 'product_procurement_sla':'mean', \
'payday_flag':'sum','occassion_flag':'sum','is_cod':'sum', 'is_mass_market':'sum', \
'product_vertical_cameraaccessory':'sum', \
'product_vertical_camerabag':'sum', 'product_vertical_camerabattery':'sum', \
'product_vertical_camerabatterycharger':'sum', 'product_vertical_camerabatterygrip': \
'sum','product_vertical_cameraeyecup':'sum','product_vertical_camerafilmrolls':'sum', \
'product_vertical_camerahousing':'sum','product_vertical_cameraledlight':'sum', \
'product_vertical_cameramicrophone':'sum','product_vertical_cameramount':'sum', \
'product_vertical_cameraremotecontrol':'sum', 'product_vertical_cameratripod':'sum', \
'product_vertical_extensiontube':'sum', 'product_vertical_filter':'sum', \
'product_vertical_flash':'sum','product_vertical_flashshoeadapter':'sum', \
'product_vertical_lens':'sum','product_vertical_reflectorumbrella':'sum', \
'product_vertical_softbox':'sum','product_vertical_strap':'sum', \
'product_vertical_teleconverter':'sum','product_vertical_telescope':'sum'}) \
.reset_index(drop=False)
cameraaccessory_df.shape
gamingaccessory_df = gamingaccessory_df.groupby(['Week']).agg({'gmv':"sum",'Discount%':'mean','deliverybdays':"mean", \
'deliverycdays':'mean','sla':'mean','product_procurement_sla': \
'mean','payday_flag':'sum','occassion_flag':'sum','is_cod':'sum', \
'is_mass_market':'sum','product_vertical_gamecontrolmount':'sum', \
'product_vertical_gamepad':'sum', \
'product_vertical_gamingaccessorykit':'sum', \
'product_vertical_gamingadapter':'sum', \
'product_vertical_gamingchargingstation':'sum', \
'product_vertical_gamingheadset':'sum', \
'product_vertical_gamingkeyboard':'sum', \
'product_vertical_gamingmemorycard':'sum', \
'product_vertical_gamingmouse':'sum', \
'product_vertical_gamingmousepad':'sum', \
'product_vertical_gamingmousepad':'sum', \
'product_vertical_gamingspeaker':'sum',
'product_vertical_joystickgamingwheel':'sum', \
'product_vertical_motioncontroller':'sum', \
'product_vertical_tvoutcableaccessory':'sum'}) \
.reset_index(drop=False)
gamingaccessory_df.shape
homeaudio_df = homeaudio_df.groupby(['Week']).agg({'gmv':"sum",'Discount%':'mean','deliverybdays': \
'mean','deliverycdays':'mean','sla':'mean','product_procurement_sla':'mean',\
'payday_flag':'sum','occassion_flag':'sum','is_cod':'sum', \
'is_mass_market':'sum','product_vertical_djcontroller':'sum', \
'product_vertical_dock':'sum', 'product_vertical_dockingstation':'sum', \
'product_vertical_fmradio':'sum', 'product_vertical_hifisystem':'sum', \
'product_vertical_homeaudiospeaker':'sum', \
'product_vertical_karaokeplayer':'sum', 'product_vertical_slingbox':'sum', \
'product_vertical_soundmixer':'sum','product_vertical_voicerecorder':'sum'})\
.reset_index(drop=False)
homeaudio_df.shape
cameraaccessory_df['payday_week'] = cameraaccessory_df['payday_flag'].apply(lambda x:1 if x > 0 else 0)
gamingaccessory_df['payday_week'] = gamingaccessory_df['payday_flag'].apply(lambda x:1 if x > 0 else 0)
homeaudio_df['payday_week'] = homeaudio_df['payday_flag'].apply(lambda x:1 if x > 0 else 0)
cameraaccessory_df['holiday_week'] = cameraaccessory_df['occassion_flag'].apply(lambda x:1 if x > 0 else 0)
gamingaccessory_df['holiday_week'] = gamingaccessory_df['occassion_flag'].apply(lambda x:1 if x > 0 else 0)
homeaudio_df['holiday_week'] = homeaudio_df['occassion_flag'].apply(lambda x:1 if x > 0 else 0)
cameraaccessory_df.head()
Dropping columns 'payday_flag' and 'occassion_flag'
drop_columns = ['payday_flag','occassion_flag']
cameraaccessory_df = cameraaccessory_df.drop(drop_columns, axis=1)
gamingaccessory_df = gamingaccessory_df.drop(drop_columns, axis=1)
homeaudio_df = homeaudio_df.drop(drop_columns, axis=1)
cameraaccessory_df.head()
print('Shape of cameraaccessory_df: {}'.format(cameraaccessory_df.shape))
print('Shape of gamingaccessory_df: {}'.format(gamingaccessory_df.shape))
print('Shape of homeaudio_df: {}'.format(homeaudio_df.shape))
print(cameraaccessory_df.Week.unique())
print("--------------------------------------------------------")
print(gamingaccessory_df.Week.unique())
print("--------------------------------------------------------")
print(homeaudio_df.Week.unique())
print("--------------------------------------------------------")
# Formatting pre merge
cameraaccessory_df.Week = cameraaccessory_df.Week.apply(lambda x: x.lstrip('0'))
gamingaccessory_df.Week = gamingaccessory_df.Week.apply(lambda x: x.lstrip('0'))
homeaudio_df.Week = homeaudio_df.Week.apply(lambda x: x.lstrip('0'))
# Taking an left join on Week column
cameraaccessory_df = pd.merge(cameraaccessory_df, media_investment, how='left', on='Week')
gamingaccessory_df = pd.merge(gamingaccessory_df, media_investment, how='left', on='Week')
homeaudio_df = pd.merge(homeaudio_df, media_investment, how='left', on='Week')
print('Are there any null values in cameraaccessory_df: {}'.format(cameraaccessory_df.isnull().values.any()))
print('Are there any null values in gamingaccessory_df: {}'.format(gamingaccessory_df.isnull().values.any()))
print('Are there any null values in homeaudio_df: {}'.format(homeaudio_df.isnull().values.any()))
print('Shape of cameraaccessory_df: {}'.format(cameraaccessory_df.shape))
print('Shape of gamingaccessory_df: {}'.format(gamingaccessory_df.shape))
print('Shape of homeaudio_df: {}'.format(homeaudio_df.shape))
# Taking an left join on Week column
cameraaccessory_df = pd.merge(cameraaccessory_df, net_promoter_score, how='left', on='Week')
gamingaccessory_df = pd.merge(gamingaccessory_df, net_promoter_score, how='left', on='Week')
homeaudio_df = pd.merge(homeaudio_df, net_promoter_score, how='left', on='Week')
print('Are there any null values in cameraaccessory_df: {}'.format(cameraaccessory_df.isnull().values.any()))
print('Are there any null values in gamingaccessory_df: {}'.format(gamingaccessory_df.isnull().values.any()))
print('Are there any null values in homeaudio_df: {}'.format(homeaudio_df.isnull().values.any()))
print('Shape of cameraaccessory_df: {}'.format(cameraaccessory_df.shape))
print('Shape of gamingaccessory_df: {}'.format(gamingaccessory_df.shape))
print('Shape of homeaudio_df: {}'.format(homeaudio_df.shape))
# Taking an left join on Week column
cameraaccessory_df = pd.merge(cameraaccessory_df, climate, how='left', on='Week')
gamingaccessory_df = pd.merge(gamingaccessory_df, climate, how='left', on='Week')
homeaudio_df = pd.merge(homeaudio_df, climate, how='left', on='Week')
print('Are there any null values in cameraaccessory_df: {}'.format(cameraaccessory_df.isnull().values.any()))
print('Are there any null values in gamingaccessory_df: {}'.format(gamingaccessory_df.isnull().values.any()))
print('Are there any null values in homeaudio_df: {}'.format(homeaudio_df.isnull().values.any()))
As expected, there are nulls in the climate columns now for the week# 3 which doesn't exist in the climate dataset
print('No of rows with null values in cameraaccessory_df:{}'.format(cameraaccessory_df[cameraaccessory_df.isnull().any(axis=1)].shape))
print('No of rows with null values in gamingaccessory_df:{}'.format(gamingaccessory_df[gamingaccessory_df.isnull().any(axis=1)].shape))
print('No of rows with null values in homeaudio_df:{}'.format(homeaudio_df[homeaudio_df.isnull().any(axis=1)].shape))
Since the number of rows with nulls is only 1, we will drop them
cols = ['Max Temp', 'Min Temp', 'Mean Temp', 'Heat Deg Days', 'Cool Deg Days', 'Total Rain (mm)', 'Total Snow (cm)', 'Total Precip (mm)', 'Snow on Grnd (cm)']
cameraaccessory_df.dropna(subset=cols, inplace=True)
gamingaccessory_df.dropna(subset=cols, inplace=True)
homeaudio_df.dropna(subset=cols, inplace=True)
print('Are there any null values in cameraaccessory_df: {}'.format(cameraaccessory_df.isnull().values.any()))
print('Are there any null values in gamingaccessory_df: {}'.format(gamingaccessory_df.isnull().values.any()))
print('Are there any null values in homeaudio_df: {}'.format(homeaudio_df.isnull().values.any()))
No Nulls
print('Shape of cameraaccessory_df: {}'.format(cameraaccessory_df.shape))
print('Shape of gamingaccessory_df: {}'.format(gamingaccessory_df.shape))
print('Shape of homeaudio_df: {}'.format(homeaudio_df.shape))
# Taking an left join on Week column
cameraaccessory_df = pd.merge(cameraaccessory_df, sale_calendar, how='left', on='Week')
gamingaccessory_df = pd.merge(gamingaccessory_df, sale_calendar, how='left', on='Week')
homeaudio_df = pd.merge(homeaudio_df, sale_calendar, how='left', on='Week')
print('Are there any null values in cameraaccessory_df: {}'.format(cameraaccessory_df.isnull().values.any()))
print('Are there any null values in gamingaccessory_df: {}'.format(gamingaccessory_df.isnull().values.any()))
print('Are there any null values in homeaudio_df: {}'.format(homeaudio_df.isnull().values.any()))
# Imputing the nulls with 0 meaning for those weeks there was 0 days in special sale
cameraaccessory_df['Sale'].fillna(value=0, inplace=True)
gamingaccessory_df['Sale'].fillna(value=0, inplace=True)
homeaudio_df['Sale'].fillna(value=0, inplace=True)
print('Are there any null values in cameraaccessory_df: {}'.format(cameraaccessory_df.isnull().values.any()))
print('Are there any null values in gamingaccessory_df: {}'.format(gamingaccessory_df.isnull().values.any()))
print('Are there any null values in homeaudio_df: {}'.format(homeaudio_df.isnull().values.any()))
No Nulls
print('Shape of cameraaccessory_df: {}'.format(cameraaccessory_df.shape))
print('Shape of gamingaccessory_df: {}'.format(gamingaccessory_df.shape))
print('Shape of homeaudio_df: {}'.format(homeaudio_df.shape))
cameraaccessory_df.head()
# String to Int
cameraaccessory_df['Week'] = cameraaccessory_df['Week'].astype('int64')
gamingaccessory_df['Week'] = gamingaccessory_df['Week'].astype('int64')
homeaudio_df['Week'] = homeaudio_df['Week'].astype('int64')
cameraaccessory_df1 = cameraaccessory_df.loc[cameraaccessory_df['Week'] >= 28]
cameraaccessory_df2 = cameraaccessory_df.loc[cameraaccessory_df['Week'] < 28]
gamingaccessory_df1 = gamingaccessory_df.loc[gamingaccessory_df['Week'] >= 28]
gamingaccessory_df2 = gamingaccessory_df.loc[gamingaccessory_df['Week'] < 28]
homeaudio_df1 = homeaudio_df.loc[homeaudio_df['Week'] >= 28]
homeaudio_df2 = homeaudio_df.loc[homeaudio_df['Week'] < 28]
cameraaccessory_df = cameraaccessory_df1.append(cameraaccessory_df2)
gamingaccessory_df = gamingaccessory_df1.append(gamingaccessory_df2)
homeaudio_df = homeaudio_df1.append(homeaudio_df2)
gamingaccessory_df.head()
cameraaccessory_df.fillna(value=0, inplace=True)
gamingaccessory_df.fillna(value=0, inplace=True)
homeaudio_df.fillna(value=0, inplace=True)
cameraaccessory_df.head()
print('Shape of cameraaccessory_df: {}'.format(cameraaccessory_df.shape))
print('Shape of gamingaccessory_df: {}'.format(gamingaccessory_df.shape))
print('Shape of homeaudio_df: {}'.format(homeaudio_df.shape))
main_df.shape
product_sub = main_df.loc[(main_df['product_analytic_sub_category'] == 'cameraaccessory')| \
(main_df['product_analytic_sub_category'] == 'gamingaccessory')| \
(main_df['product_analytic_sub_category'] == 'homeaudio')]
product_sub.shape
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(3, 3), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
sns.boxplot(x='product_type', y='Discount%', palette='coolwarm', data=product_sub)
# plot legend
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
This is a known trend among luxury products or luxury brands to offer limited or no discounts to retain the exclusivity of their products.
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(2, 1, 1)
plt.plot(net_promoter_score.iloc[:,0], net_promoter_score.iloc[:,1], 'g-', linewidth=2)
plt.xlabel('Week', fontsize=10);
plt.ylabel('NPS', fontsize=10);
# subplot 2
plt.subplot(2, 1, 2)
plt.plot(net_promoter_score.iloc[:,0], net_promoter_score.iloc[:,4], 'r-', linewidth=2)
plt.xlabel('Week', fontsize=10);
plt.ylabel('Stock Index', fontsize=10);
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
Consumer NPS score is highest in weeks 32 – 35 , which coincides with the time when maximum discounts were being offered.
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(8, 5), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(1, 1, 1)
plt.plot(media_investment.iloc[:,0], media_investment.iloc[:,1::5], linewidth=3, alpha = 0.7)
plt.xlabel('Week#', fontsize=10)
plt.ylabel('Media Channel Investments', fontsize=10)
plt.legend(media_investment.iloc[:,1::5])
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
Over the past year, bulk of the Ad Investment has been made in Sponsorships followed by Online Marketing & Search Engine Marketing(specially during Thanksgiving).
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(6,4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
sns.barplot(y='gmv', x='product_analytic_sub_category', hue ='occassion_flag', \
palette='husl', data=product_sub, estimator=np.median)
# plot legend
plt.legend(frameon=True, fontsize='small', shadow='True', title='Is a Holiday? 0-No 1-Yes', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
product_sub.groupby(["product_type", "product_analytic_sub_category"]).size().unstack().plot(kind='bar', \
stacked=True, figsize=(8,6), \
fontsize = 10)
# plot x axis label
plt.xlabel('Product Type', fontsize = 12)
# plot y axis label
plt.ylabel('No of Items Sold', fontsize = 12)
# plot legend
plt.legend(frameon=True, fontsize='small', shadow='True', title='Product Sub-category', \
bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
product_sub.groupby(["Month", "product_analytic_sub_category"]).size().unstack().plot(kind='bar', \
stacked=True, figsize=(8,6), \
fontsize = 10)
# plot x axis label
plt.xlabel('Month', fontsize = 12)
# plot y axis label
plt.ylabel('No of Items Sold', fontsize = 12)
# plot legend
plt.legend(frameon=True, fontsize='small', shadow='True', title='Product Sub-category', \
bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
product_sub.groupby(["Week", "product_analytic_sub_category"]).size().unstack().plot(kind='bar', \
stacked=True, figsize=(15,6), \
fontsize = 10)
# plot x axis label
plt.xlabel('Week', fontsize = 12)
# plot y axis label
plt.ylabel('No of Items Sold', fontsize = 12)
# plot legend
plt.legend(frameon=True, fontsize='small', shadow='True', title='Product Sub-category', \
bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
Thanksgiving week) is maximum. Overall, October has seen most no of items being sold.¶highest_gmv = pd.DataFrame(product_sub.groupby(['product_analytic_sub_category','product_analytic_vertical'])['gmv'].sum().sort_values(ascending=False).reset_index()).head(10)
highest_gmv['product_in_category'] = highest_gmv[['product_analytic_vertical','product_analytic_sub_category']].apply(lambda x: ' in '.join(x), axis=1)
highest_gmv.head(20)
import squarify # pip install squarify (algorithm for treemap)
# Create a dataset:
my_values=list(highest_gmv['gmv'])
plt.figure(figsize=(12,6), dpi=80, facecolor='w', edgecolor='k')
# create a color palette, mapped to these values
cmap = matplotlib.cm.Blues
mini=min(my_values)
maxi=max(my_values)
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in my_values]
# Change color
squarify.plot(sizes=my_values, alpha=.8, label=highest_gmv['product_in_category'],color=colors)
plt.axis('off')
# Show graphic
plt.tight_layout()
plt.show()
homeaudiospeaker in homeaudio brought the largest revenue followed by lens in cameraaccessory & gamepad in gamingaccessory.¶most_sales = pd.DataFrame(product_sub.groupby(['product_analytic_sub_category','product_analytic_vertical'])['units'].count().sort_values(ascending=False).reset_index()).head(10)
most_sales['product_in_category'] = most_sales[['product_analytic_vertical','product_analytic_sub_category']].apply(lambda x: ' in '.join(x), axis=1)
most_sales.head(20)
import squarify # pip install squarify (algorithm for treemap)
# Create a dataset:
my_values=list(most_sales['units'])
plt.figure(figsize=(12,6), dpi=80, facecolor='w', edgecolor='k')
# create a color palette, mapped to these values
cmap = matplotlib.cm.Blues
mini=min(my_values)
maxi=max(my_values)
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in my_values]
# Change color
squarify.plot(sizes=my_values, alpha=.8, label=most_sales['product_in_category'],color=colors)
plt.axis('off')
# Show graphic
plt.tight_layout()
plt.show()
homeaudiospeaker in homeaudio had the most no of sales followed by gamingheadset & gamepad in gamingaccessory.¶cameraaccessory = cameraaccessory_df.copy()
gamingaccessory = gamingaccessory_df.copy()
homeaudio = homeaudio_df.copy()
cameraaccessory['Week'] = cameraaccessory['Week'].apply(str)
gamingaccessory['Week'] = gamingaccessory['Week'].apply(str)
homeaudio['Week'] = homeaudio['Week'].apply(str)
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 8), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(2, 1, 1)
plt.plot(gamingaccessory.iloc[:,0], gamingaccessory.iloc[:,1], 'b-', linewidth=2, alpha=0.7)
plt.plot(cameraaccessory.iloc[:,0], cameraaccessory.iloc[:,1], 'r-', linewidth=2, alpha=0.7)
plt.plot(homeaudio.iloc[:,0], homeaudio.iloc[:,1], 'g-', linewidth=2, alpha=0.7)
plt.xlabel('Week', fontsize=10);
plt.ylabel('Total GMV', fontsize=10);
plt.legend(['gamingaccessory gmv','cameraaccessory gmv','homeaudio gmv'])
# subplot 2
plt.subplot(2, 1, 2)
plt.plot(gamingaccessory.iloc[:,0], gamingaccessory.iloc[:,2], 'b-', linewidth=2, alpha=0.7)
plt.plot(cameraaccessory.iloc[:,0], cameraaccessory.iloc[:,2], 'r-', linewidth=2, alpha=0.7)
plt.plot(homeaudio.iloc[:,0], homeaudio.iloc[:,2], 'g-', linewidth=2, alpha=0.7)
plt.plot(homeaudio.iloc[:,0], homeaudio.iloc[:,21], 'c-', linewidth=2, alpha=0.7)
plt.xlabel('Week', fontsize=10);
plt.ylabel('Promotion & Ads', fontsize=10);
plt.legend(['gamingaccessory disount','cameraaccessory disount','homeaudio disount','total ad investment'])
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
For the week# 42 (during Thanksgiving), all the graphs show a steep rise. Revenue increased becuase of both higher discount% and increased Ad Investment.
For the week 32(August), Revenue generated was the lowest from all 3 product subcategories. This can be observed as a direct relation to minimum amount of total investment in Ads. Discount was also lowest for all products apart from camera accessories. Post this dip in revenue, discount% was increased to bring about higher sales. This increase in Discount% was observed most in the case of gaming accessories. However, barring home audio products, the revenue from other products was seen to be constant for the next 3 weeks after which, the revenue started to pick up.
In general the average disount% offered for home audio products is lesser compared to that of the other product subcategories.
# Segmenting the Discount% into various bins
def discount_binning(df,cut_points,label_names):
column_index = df.columns.get_loc('Discount%') + 1
df.insert(loc=column_index,column='Discount Bins',value=pd.cut(df['Discount%'],cut_points,labels=label_names, include_lowest=True))
return df
cut_points = [0,10,20,30,40,50,60,70,80,90,100]
label_names = ["Below 10%","Between 10-20%","Between 20-30%","Between 30-40%","Between 40-50%","Between 50-60%", \
"Between 60-70%","Between 70-80%","Between 80-90%","Between 90-100%"]
product_sub = discount_binning(product_sub,cut_points,label_names)
product_sub.head()
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10,4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(1, 2, 1)
sns.barplot(x='gmv', y='Discount Bins', palette='husl', data=product_sub, estimator=np.median)
# subplot 2
plt.subplot(1, 2, 2)
sns.countplot(y='Discount Bins', palette='husl', data=product_sub)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
Median Revenue is maximum when Average discount% is between 10-20%. But beyond that, average revenue slowly starts to decline.
The sales on the other hand shows a steady increase with increase in Discount percentage till it peaks at 50-60% after which it starts to fall again.
from collections import Counter
labels, values = zip(*Counter(product_sub["Discount Bins"]).items())
colors = ["#ff9999", "#99d8c9", "#2ca25f", "#8856a7","#43a2ca","#fdbb84","#e34a33","#bdbdbd","#636363","#f7fcb9"]
piechart_df = (pd.DataFrame(list(values),list(labels)))
piechart_df = piechart_df.reset_index()
fig = plt.figure(figsize=[6,6])
plt.pie(piechart_df[0],labels=piechart_df["index"],startangle=180,explode=(0,0,0,0.1,0,0,0,0,0,0),autopct="%1.1f%%", \
shadow=True, colors=colors)
plt.tight_layout()
plt.title("No of Items sold at Different Discount%", fontsize=15)
plt.show()
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10,4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(1, 2, 1)
sns.barplot(x='gmv', y='product_analytic_sub_category', hue ='s1_fact.order_payment_type', \
palette='coolwarm', data=product_sub, estimator=np.median)
# subplot 2
plt.subplot(1, 2, 2)
sns.countplot(y='product_analytic_sub_category', hue ='s1_fact.order_payment_type', palette='coolwarm', data=product_sub)
# plot legend
plt.legend(frameon=True, fontsize='small', shadow='True', title='Payment Type', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
product_sub['luxury'] = product_sub['product_type'].apply(lambda x:1 if x=='luxury' else 0)
product_sub['mass_market'] = product_sub['product_type'].apply(lambda x:1 if x=='mass_market' else 0)
product_sub.head()
product_type = pd.DataFrame(product_sub.groupby('product_analytic_sub_category')['luxury','mass_market'].sum().reset_index())
product_type
# From raw value to percentage
r = [0,1,2]
totals = [i+j for i,j in zip(product_type['luxury'], product_type['mass_market'])]
luxury = [i / j * 100 for i,j in zip(product_type['luxury'], totals)]
mass_market = [i / j * 100 for i,j in zip(product_type['mass_market'], totals)]
names = list(product_type['product_analytic_sub_category'])
# plot
# adjust figure size
plt.figure(figsize=(6,6), dpi=80, facecolor='w', edgecolor='k')
barWidth = 0.85
# Create Luxury Bars
plt.bar(r, luxury, color='#b5ffb9', edgecolor='white', width=barWidth)
# Create mass_market Bars
plt.bar(r, mass_market, bottom=[i for i in luxury], color='#a3acff', edgecolor='white', width=barWidth)
# Custom x axis
plt.xticks(r, names, rotation='vertical')
plt.legend(['luxury','mass_market'],frameon=True, fontsize='small', shadow='True', title='Product Type', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title("Percentage of Luxury & Mass-market Product for different Sub-categories")
# Show graphic
plt.tight_layout()
plt.show()
product_sub['prepaid'] = product_sub['s1_fact.order_payment_type'].apply(lambda x:1 if x=='prepaid' else 0)
product_sub['cod'] = product_sub['s1_fact.order_payment_type'].apply(lambda x:1 if x=='cod' else 0)
product_sub.head()
payment_type = pd.DataFrame(product_sub.groupby('product_analytic_sub_category')['prepaid','cod'].sum().reset_index())
payment_type
# From raw value to percentage
r = [0,1,2]
totals = [i+j for i,j in zip(payment_type['prepaid'], payment_type['cod'])]
prepaid = [i / j * 100 for i,j in zip(payment_type['prepaid'], totals)]
cod = [i / j * 100 for i,j in zip(payment_type['cod'], totals)]
names = list(payment_type['product_analytic_sub_category'])
# plot
# adjust figure size
plt.figure(figsize=(6,6), dpi=80, facecolor='w', edgecolor='k')
barWidth = 0.85
# Create prepaid Bars
plt.bar(r, prepaid, color='#b5ffb9', edgecolor='white', width=barWidth)
# Create cod Bars
plt.bar(r, cod, bottom=[i for i in prepaid], color='#a3acff', edgecolor='white', width=barWidth)
# Custom x axis
plt.xticks(r, names, rotation='vertical')
plt.legend(['prepaid','cod'],frameon=True, fontsize='small', shadow='True', title='Payment Type', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title("Percentage of Prepaid & COD Products for different Sub-categories")
# Show graphic
plt.tight_layout()
plt.show()
product_type_with_discount = pd.DataFrame(product_sub.groupby('Discount Bins')['luxury','mass_market'].sum().reset_index())
product_type_with_discount
# From raw value to percentage
r = [0,1,2,3,4,5,6,7,8,9]
totals = [i+j for i,j in zip(product_type_with_discount['luxury'], product_type_with_discount['mass_market'])]
luxury = [i / j * 100 for i,j in zip(product_type_with_discount['luxury'], totals)]
mass_market = [i / j * 100 for i,j in zip(product_type_with_discount['mass_market'], totals)]
names = list(product_type_with_discount['Discount Bins'])
# plot
# adjust figure size
plt.figure(figsize=(10,6), dpi=80, facecolor='w', edgecolor='k')
barWidth = 0.85
# Create Luxury Bars
plt.bar(r, luxury, color='#b5ffb9', edgecolor='white', width=barWidth)
# Create mass_market Bars
plt.bar(r, mass_market, bottom=[i for i in luxury], color='#a3acff', edgecolor='white', width=barWidth)
# Custom x axis
plt.xticks(r, names, rotation='vertical')
plt.legend(['luxury','mass_market'],frameon=True, fontsize='small', shadow='True', title='Product Type', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title("Percentage of Luxury & Mass-market Product under different Discount Group")
# Show graphic
plt.tight_layout()
plt.show()
# Slightly alter the figure size to make it more horizontal.
#plt.figure(figsize=(10,4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
sns.pairplot(cameraaccessory, x_vars=['Total Investment', 'TV', 'Digital', 'Sponsorship', 'Content Marketing', \
'Online marketing', 'Affiliates', 'SEM', 'Radio', 'Other'], \
y_vars='gmv',size=4, aspect=0.5, kind='reg')
plt.title('cameraaccessory', fontsize = 20)
sns.pairplot(gamingaccessory, x_vars=['Total Investment', 'TV', 'Digital', 'Sponsorship', 'Content Marketing', \
'Online marketing', 'Affiliates', 'SEM', 'Radio', 'Other'], \
y_vars='gmv',size=4, aspect=0.5, kind='reg')
plt.title('gamingaccessory', fontsize = 20)
sns.pairplot(homeaudio, x_vars=['Total Investment', 'TV', 'Digital', 'Sponsorship', 'Content Marketing', \
'Online marketing', 'Affiliates', 'SEM', 'Radio', 'Other'], \
y_vars='gmv',size=4, aspect=0.5, kind='reg')
plt.title('homeaudio', fontsize = 20)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
rev_ad_columns = ['gmv','Total Investment', 'Total Investment_SMA_3', 'Total Investment_SMA_5', 'Total Investment_EMA_8', \
'Total_Investment_Ad_Stock', 'TV', 'TV_SMA_3', 'TV_SMA_5', 'TV_EMA_8', 'TV_Ad_Stock', 'Digital', \
'Digital_SMA_3', 'Digital_SMA_5', 'Digital_EMA_8', 'Digital_Ad_Stock', 'Sponsorship', 'Sponsorship_SMA_3', \
'Sponsorship_SMA_5', 'Sponsorship_EMA_8', 'Sponsorship_Ad_Stock', 'Content Marketing', \
'Content Marketing_SMA_3','Content Marketing_SMA_5','Content Marketing_EMA_8', 'Content_Marketing_Ad_Stock', \
'Online marketing', 'Online marketing_SMA_3', 'Online marketing_SMA_5', 'Online marketing_EMA_8', \
'Online_marketing_Ad_Stock', 'Affiliates', 'Affiliates_SMA_3', 'Affiliates_SMA_5', 'Affiliates_EMA_8', \
'Affiliates_Ad_Stock', 'SEM', 'SEM_SMA_3', 'SEM_SMA_5', 'SEM_EMA_8', 'SEM_Ad_Stock', 'Radio', 'Radio_SMA_3', \
'Radio_SMA_5', 'Radio_EMA_8', 'Radio_Ad_Stock', 'Other', 'Other_SMA_3', 'Other_SMA_5', 'Other_EMA_8', \
'Other_Ad_Stock']
cameraaccessory_ad = cameraaccessory[rev_ad_columns]
gamingaccessory_ad = gamingaccessory[rev_ad_columns]
homeaudio_ad = homeaudio[rev_ad_columns]
homeaudio_ad.head()
plt.figure(figsize=(20,10), dpi=80, facecolor='w', edgecolor='k', frameon='True')
cam_cor = cameraaccessory_ad.corr()
sns.heatmap(cam_cor, cmap="YlGnBu", annot=True)
plt.show()
plt.figure(figsize=(20,10), dpi=80, facecolor='w', edgecolor='k', frameon='True')
gam_cor = gamingaccessory_ad.corr()
sns.heatmap(gam_cor, cmap="YlGnBu", annot=True)
plt.show()
plt.figure(figsize=(20,10), dpi=80, facecolor='w', edgecolor='k', frameon='True')
ha_cor = homeaudio_ad.corr()
sns.heatmap(ha_cor, cmap="YlGnBu", annot=True)
plt.show()
# making a copy of original dataframes
cameraaccessory_org_df = cameraaccessory_df.copy()
gamingaccessory_org_df = gamingaccessory_df.copy()
homeaudio_org_df = homeaudio_df.copy()
homeaudio_org_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(gamingaccessory_org_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(gamingaccessory_org_df.isnull().sum()/gamingaccessory_org_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
We will use Standard scaling.
Week column as it is a row identifier and will not help in prediction of revenue¶# removing columns
cameraaccessory_df = cameraaccessory_df.drop('Week', axis=1)
gamingaccessory_df = gamingaccessory_df.drop('Week', axis=1)
homeaudio_df = homeaudio_df.drop('Week', axis=1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cameraaccessory_df[cameraaccessory_df.columns]=scaler.fit_transform(cameraaccessory_df[cameraaccessory_df.columns])
gamingaccessory_df[gamingaccessory_df.columns]=scaler.fit_transform(gamingaccessory_df[gamingaccessory_df.columns])
homeaudio_df[homeaudio_df.columns]=scaler.fit_transform(homeaudio_df[homeaudio_df.columns])
cameraaccessory_df.head()
As you know, the first basic step for regression is performing a train-test split.
from sklearn.model_selection import train_test_split
# We specify this so that the train and test data set always have the same rows, respectively
cameraaccessory_train, cameraaccessory_test = train_test_split(cameraaccessory_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
gamingaccessory_train, gamingaccessory_test = train_test_split(gamingaccessory_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
homeaudio_train, homeaudio_test = train_test_split(homeaudio_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
y_cameraaccessory_train = cameraaccessory_train.pop('gmv')
X_cameraaccessory_train = cameraaccessory_train
y_gamingaccessory_train = gamingaccessory_train.pop('gmv')
X_gamingaccessory_train = gamingaccessory_train
y_homeaudio_train = homeaudio_train.pop('gmv')
X_homeaudio_train = homeaudio_train
X_cameraaccessory_train.head()
y_cameraaccessory_test = cameraaccessory_test.pop('gmv')
X_cameraaccessory_test = cameraaccessory_test
y_gamingaccessory_test = gamingaccessory_test.pop('gmv')
X_gamingaccessory_test = gamingaccessory_test
y_homeaudio_test = homeaudio_test.pop('gmv')
X_homeaudio_test = homeaudio_test
X_cameraaccessory_test.head()
cameraaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
cameraaccessory_model = LinearRegression().fit(X_cameraaccessory_train, y_cameraaccessory_train)
y_cameraaccessory_test_pred = cameraaccessory_model.predict(X_cameraaccessory_test)
print('R2 Score: {}'.format(r2_score(y_cameraaccessory_test, y_cameraaccessory_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory_test, y_cameraaccessory_test_pred)))
cameraaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_cameraaccessory = cameraaccessory_df.pop('gmv')
X_cameraaccessory = cameraaccessory_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
cameraaccessory_model_cv = LinearRegression().fit(X_cameraaccessory, y_cameraaccessory)
cameraaccessory_predictions_cv = cross_val_predict(cameraaccessory_model_cv, X_cameraaccessory, y_cameraaccessory, cv=10)
accuracy = metrics.r2_score(y_cameraaccessory, cameraaccessory_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory, cameraaccessory_predictions_cv)))
Here R2 score is negative which signifies that the chosen model does not follow the trend of the data, so it fits worse than a horizontal line. It simply means the chosen model (with its constraints) fits the data really poorly.
cameraaccessory from model without cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
cameraaccessory_lr_model_parameters = list(cameraaccessory_model.coef_)
cameraaccessory_lr_model_parameters.insert(0, cameraaccessory_model.intercept_)
cameraaccessory_lr_model_parameters = [round(x, 3) for x in cameraaccessory_lr_model_parameters]
cols = X_cameraaccessory_test.columns
cols = cols.insert(0, "constant")
cameraaccessory_lr_coef = list(zip(cols, cameraaccessory_lr_model_parameters))
cameraaccessory_lr_coef
cameraaccessory_lr_coef_df = pd.DataFrame(cameraaccessory_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.rename(columns=col_rename)
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.iloc[1:,:]
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.loc[cameraaccessory_lr_coef_df['Coefficients']!=0.0]
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
cameraaccessory_lr_coef_df
cameraaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=cameraaccessory_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_lens | 0.432 |
| product_vertical_camerabattery | 0.200 |
| product_vertical_camerabag | 0.182 |
| product_vertical_camerahousing | 0.171 |
| Online marketing | 0.151 |
gamingaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
gamingaccessory_model = LinearRegression().fit(X_gamingaccessory_train, y_gamingaccessory_train)
y_gamingaccessory_test_pred = gamingaccessory_model.predict(X_gamingaccessory_test)
print('R2 Score: {}'.format(r2_score(y_gamingaccessory_test, y_gamingaccessory_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory_test, y_gamingaccessory_test_pred)))
gamingaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_gamingaccessory = gamingaccessory_df.pop('gmv')
X_gamingaccessory = gamingaccessory_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
gamingaccessory_model_cv = LinearRegression().fit(X_gamingaccessory, y_gamingaccessory)
gamingaccessory_predictions_cv = cross_val_predict(gamingaccessory_model_cv, X_gamingaccessory, y_gamingaccessory, cv=10)
accuracy = metrics.r2_score(y_gamingaccessory, gamingaccessory_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory, gamingaccessory_predictions_cv)))
gamingaccessory from model without cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
gamingaccessory_lr_model_parameters = list(gamingaccessory_model.coef_)
gamingaccessory_lr_model_parameters.insert(0, gamingaccessory_model.intercept_)
gamingaccessory_lr_model_parameters = [round(x, 3) for x in gamingaccessory_lr_model_parameters]
cols = X_gamingaccessory_test.columns
cols = cols.insert(0, "constant")
gamingaccessory_lr_coef = list(zip(cols, gamingaccessory_lr_model_parameters))
gamingaccessory_lr_coef
gamingaccessory_lr_coef_df = pd.DataFrame(gamingaccessory_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.rename(columns=col_rename)
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.iloc[1:,:]
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.loc[gamingaccessory_lr_coef_df['Coefficients']!=0.0]
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
gamingaccessory_lr_coef_df
gamingaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=gamingaccessory_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_gamepad | 0.201 |
| product_vertical_gamingheadset | 0.183 |
| is_mass_market | 0.167 |
| product_vertical_gamingaccessorykit | 0.126 |
| product_vertical_gamingmouse | 0.107 |
homeaudio¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
homeaudio_model = LinearRegression().fit(X_homeaudio_train, y_homeaudio_train)
y_homeaudio_test_pred = homeaudio_model.predict(X_homeaudio_test)
print('R2 Score: {}'.format(r2_score(y_homeaudio_test, y_homeaudio_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio_test, y_homeaudio_test_pred)))
homeaudio using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_homeaudio = homeaudio_df.pop('gmv')
X_homeaudio = homeaudio_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
homeaudio_model_cv = LinearRegression().fit(X_homeaudio, y_homeaudio)
homeaudio_predictions_cv = cross_val_predict(homeaudio_model_cv, X_homeaudio, y_homeaudio, cv=5)
accuracy = metrics.r2_score(y_homeaudio, homeaudio_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio, homeaudio_predictions_cv)))
homeaudio from model without cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
homeaudio_lr_model_parameters = list(homeaudio_model.coef_)
homeaudio_lr_model_parameters.insert(0, homeaudio_model.intercept_)
homeaudio_lr_model_parameters = [round(x, 3) for x in homeaudio_lr_model_parameters]
cols = homeaudio_test.columns
cols = cols.insert(0, "constant")
homeaudio_lr_coef = list(zip(cols, homeaudio_lr_model_parameters))
homeaudio_lr_coef
homeaudio_lr_coef_df = pd.DataFrame(homeaudio_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
homeaudio_lr_coef_df = homeaudio_lr_coef_df.rename(columns=col_rename)
homeaudio_lr_coef_df = homeaudio_lr_coef_df.iloc[1:,:]
homeaudio_lr_coef_df = homeaudio_lr_coef_df.loc[homeaudio_lr_coef_df['Coefficients']!=0.0]
homeaudio_lr_coef_df = homeaudio_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
homeaudio_lr_coef_df
homeaudio¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=homeaudio_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_homeaudiospeaker | 0.392 |
| is_mass_market | 0.148 |
| Digital_SMA_3 | 0.139 |
| product_vertical_fmradio | 0.133 |
| is_cod | 0.131 |
The Linear Regression Model that we built earlier is an additive model that has been an implicit assumption that the different KPIs affect the revenue additively.
Y = α + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + ϵ
However when there are interactions between the KPIs, we go for a multiplicative model.
To fit a multiplicative model, take logarithms of the data(on both sides of the model), then analyse the log data as before.
Y = e^α .X1^β1 . X2^β2 . X3^β3 . X4^β4 . X5^β5 + ϵ
lnY = α + β1ln(X1) + β2ln(X2) + β3ln(X3) + β4ln(X4) + β5ln(X5) + ϵ'
homeaudio_org_df.head()
# Making copy of dataframes from the original ones
cameraaccessory_mul_df = cameraaccessory_org_df.copy()
gamingaccessory_mul_df = gamingaccessory_org_df.copy()
homeaudio_mul_df = homeaudio_org_df.copy()
homeaudio_mul_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_mul_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_mul_df.isnull().sum()/homeaudio_mul_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
Week column as it is a row identifier and will not help in prediction of revenue¶# removing columns
cameraaccessory_mul_df = cameraaccessory_mul_df.drop('Week', axis=1)
gamingaccessory_mul_df = gamingaccessory_mul_df.drop('Week', axis=1)
homeaudio_mul_df = homeaudio_mul_df.drop('Week', axis=1)
homeaudio_mul_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_mul_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_mul_df.isnull().sum()/homeaudio_mul_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
cameraaccessory_mul_df = cameraaccessory_mul_df.applymap(lambda x: np.log(x))
cameraaccessory_mul_df = cameraaccessory_mul_df.replace([np.inf, -np.inf], 0)
cameraaccessory_mul_df = cameraaccessory_mul_df.replace(np.nan, 0)
gamingaccessory_mul_df = gamingaccessory_mul_df.applymap(lambda x: np.log(x))
gamingaccessory_mul_df = gamingaccessory_mul_df.replace([np.inf, -np.inf], 0)
gamingaccessory_mul_df = gamingaccessory_mul_df.replace(np.nan, 0)
homeaudio_mul_df = homeaudio_mul_df.applymap(lambda x: np.log(x))
homeaudio_mul_df = homeaudio_mul_df.replace([np.inf, -np.inf], 0)
homeaudio_mul_df = homeaudio_mul_df.replace(np.nan, 0)
homeaudio_mul_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_mul_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_mul_df.isnull().sum()/homeaudio_mul_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
We will use Standard scaling.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cameraaccessory_mul_df[cameraaccessory_mul_df.columns]=scaler.fit_transform(cameraaccessory_mul_df[cameraaccessory_mul_df.columns])
gamingaccessory_mul_df[gamingaccessory_mul_df.columns]=scaler.fit_transform(gamingaccessory_mul_df[gamingaccessory_mul_df.columns])
homeaudio_mul_df[homeaudio_mul_df.columns]=scaler.fit_transform(homeaudio_mul_df[homeaudio_mul_df.columns])
homeaudio_mul_df.head()
As you know, the first basic step for regression is performing a train-test split.
from sklearn.model_selection import train_test_split
# We specify this so that the train and test data set always have the same rows, respectively
cameraaccessory_mul_train, cameraaccessory_mul_test = train_test_split(cameraaccessory_mul_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
gamingaccessory_mul_train, gamingaccessory_mul_test = train_test_split(gamingaccessory_mul_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
homeaudio_mul_train, homeaudio_mul_test = train_test_split(homeaudio_mul_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
y_cameraaccessory_mul_train = cameraaccessory_mul_train.pop('gmv')
X_cameraaccessory_mul_train = cameraaccessory_mul_train
y_gamingaccessory_mul_train = gamingaccessory_mul_train.pop('gmv')
X_gamingaccessory_mul_train = gamingaccessory_mul_train
y_homeaudio_mul_train = homeaudio_mul_train.pop('gmv')
X_homeaudio_mul_train = homeaudio_mul_train
X_homeaudio_mul_train.head()
y_cameraaccessory_mul_test = cameraaccessory_mul_test.pop('gmv')
X_cameraaccessory_mul_test = cameraaccessory_mul_test
y_gamingaccessory_mul_test = gamingaccessory_mul_test.pop('gmv')
X_gamingaccessory_mul_test = gamingaccessory_mul_test
y_homeaudio_mul_test = homeaudio_mul_test.pop('gmv')
X_homeaudio_mul_test = homeaudio_mul_test
X_homeaudio_mul_test.head()
cameraaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
cameraaccessory_mul_model = LinearRegression().fit(X_cameraaccessory_mul_train, y_cameraaccessory_mul_train)
y_cameraaccessory_mul_test_pred = cameraaccessory_mul_model.predict(X_cameraaccessory_mul_test)
print('R2 Score: {}'.format(r2_score(y_cameraaccessory_mul_test, y_cameraaccessory_mul_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory_mul_test, y_cameraaccessory_mul_test_pred)))
cameraaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_cameraaccessory_mul = cameraaccessory_mul_df.pop('gmv')
X_cameraaccessory_mul = cameraaccessory_mul_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
cameraaccessory_mul_model_cv = LinearRegression().fit(X_cameraaccessory_mul, y_cameraaccessory_mul)
cameraaccessory_mul_predictions_cv = cross_val_predict(cameraaccessory_mul_model_cv, X_cameraaccessory_mul, \
y_cameraaccessory_mul, cv=10)
accuracy = metrics.r2_score(y_cameraaccessory_mul, cameraaccessory_mul_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory_mul, cameraaccessory_mul_predictions_cv)))
cameraaccessory from model with cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
cameraaccessory_mul_lr_model_parameters = list(cameraaccessory_mul_model_cv.coef_)
cameraaccessory_mul_lr_model_parameters.insert(0, cameraaccessory_mul_model_cv.intercept_)
cameraaccessory_mul_lr_model_parameters = [round(x, 3) for x in cameraaccessory_mul_lr_model_parameters]
cols = X_cameraaccessory_test.columns
cols = cols.insert(0, "constant")
cameraaccessory_mul_lr_coef = list(zip(cols, cameraaccessory_mul_lr_model_parameters))
cameraaccessory_mul_lr_coef
cameraaccessory_mul_lr_coef_df = pd.DataFrame(cameraaccessory_mul_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
cameraaccessory_mul_lr_coef_df = cameraaccessory_mul_lr_coef_df.rename(columns=col_rename)
cameraaccessory_mul_lr_coef_df = cameraaccessory_mul_lr_coef_df.iloc[1:,:]
cameraaccessory_mul_lr_coef_df = cameraaccessory_mul_lr_coef_df.loc[cameraaccessory_mul_lr_coef_df['Coefficients']!=0.0]
cameraaccessory_mul_lr_coef_df = cameraaccessory_mul_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
cameraaccessory_mul_lr_coef_df
cameraaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=cameraaccessory_mul_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_lens | 0.181 |
| product_vertical_camerabattery | 0.160 |
| is_mass_market | 0.149 |
| product_vertical_camerabatterycharger | 0.121 |
| TV | 0.105 |
gamingaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
gamingaccessory_mul_model = LinearRegression().fit(X_gamingaccessory_mul_train, y_gamingaccessory_mul_train)
y_gamingaccessory_mul_test_pred = gamingaccessory_mul_model.predict(X_gamingaccessory_mul_test)
print('R2 Score: {}'.format(r2_score(y_gamingaccessory_mul_test, y_gamingaccessory_mul_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory_mul_test, y_gamingaccessory_mul_test_pred)))
gamingaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_gamingaccessory_mul = gamingaccessory_mul_df.pop('gmv')
X_gamingaccessory_mul = gamingaccessory_mul_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
gamingaccessory_mul_model_cv = LinearRegression().fit(X_gamingaccessory_mul, y_gamingaccessory_mul)
gamingaccessory_mul_predictions_cv = cross_val_predict(gamingaccessory_mul_model_cv, X_gamingaccessory_mul, \
y_gamingaccessory_mul, cv=10)
accuracy = metrics.r2_score(y_gamingaccessory_mul, gamingaccessory_mul_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory_mul, gamingaccessory_mul_predictions_cv)))
gamingaccessory with model with cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
gamingaccessory_mul_lr_model_parameters = list(gamingaccessory_mul_model_cv.coef_)
gamingaccessory_mul_lr_model_parameters.insert(0, gamingaccessory_mul_model_cv.intercept_)
gamingaccessory_mul_lr_model_parameters = [round(x, 3) for x in gamingaccessory_mul_lr_model_parameters]
cols = X_gamingaccessory_test.columns
cols = cols.insert(0, "constant")
gamingaccessory_mul_lr_coef = list(zip(cols, gamingaccessory_mul_lr_model_parameters))
gamingaccessory_mul_lr_coef
gamingaccessory_mul_lr_coef_df = pd.DataFrame(gamingaccessory_mul_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
gamingaccessory_mul_lr_coef_df = gamingaccessory_mul_lr_coef_df.rename(columns=col_rename)
gamingaccessory_mul_lr_coef_df = gamingaccessory_mul_lr_coef_df.iloc[1:,:]
gamingaccessory_mul_lr_coef_df = gamingaccessory_mul_lr_coef_df.loc[gamingaccessory_mul_lr_coef_df['Coefficients']!=0.0]
gamingaccessory_mul_lr_coef_df = gamingaccessory_mul_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
gamingaccessory_mul_lr_coef_df
gamingaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=gamingaccessory_mul_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_gamingheadset | 0.250 |
| is_mass_market | 0.234 |
| product_vertical_gamingmouse | 0.224 |
| product_vertical_gamepad | 0.211 |
| Online marketing_SMA_3 | 0.157 |
homeaudio¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
homeaudio_mul_model = LinearRegression().fit(X_homeaudio_mul_train, y_homeaudio_mul_train)
y_homeaudio_mul_test_pred = homeaudio_mul_model.predict(X_homeaudio_mul_test)
print('R2 Score: {}'.format(r2_score(y_homeaudio_mul_test, y_homeaudio_mul_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio_mul_test, y_homeaudio_mul_test_pred)))
Here R2 is negative which signifies that the chosen model does not follow the trend of the data, so fits worse than a horizontal line. It simply means the chosen model (with its constraints) fits the data really poorly.
homeaudio using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_homeaudio_mul = homeaudio_mul_df.pop('gmv')
X_homeaudio_mul = homeaudio_mul_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
homeaudio_mul_model_cv = LinearRegression().fit(X_homeaudio_mul, y_homeaudio_mul)
homeaudio_mul_predictions_cv = cross_val_predict(homeaudio_mul_model_cv, X_homeaudio_mul, y_homeaudio_mul, cv=5)
accuracy = metrics.r2_score(y_homeaudio_mul, homeaudio_mul_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio_mul, homeaudio_mul_predictions_cv)))
homeaudio with model with cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
homeaudio_mul_lr_model_parameters = list(homeaudio_mul_model_cv.coef_)
homeaudio_mul_lr_model_parameters.insert(0, homeaudio_mul_model_cv.intercept_)
homeaudio_mul_lr_model_parameters = [round(x, 3) for x in homeaudio_mul_lr_model_parameters]
cols = X_homeaudio_test.columns
cols = cols.insert(0, "constant")
homeaudio_mul_lr_coef = list(zip(cols, homeaudio_mul_lr_model_parameters))
homeaudio_mul_lr_coef
homeaudio_mul_lr_coef_df = pd.DataFrame(homeaudio_mul_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
homeaudio_mul_lr_coef_df = homeaudio_mul_lr_coef_df.rename(columns=col_rename)
homeaudio_mul_lr_coef_df = homeaudio_mul_lr_coef_df.iloc[1:,:]
homeaudio_mul_lr_coef_df = homeaudio_mul_lr_coef_df.loc[homeaudio_mul_lr_coef_df['Coefficients']!=0.0]
homeaudio_mul_lr_coef_df = homeaudio_mul_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
homeaudio_mul_lr_coef_df
homeaudio¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=homeaudio_mul_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_homeaudiospeaker | 0.469 |
| is_mass_market | 0.289 |
| product_vertical_fmradio | 0.224 |
| Radio_Ad_Stock | 0.147 |
| Sponsorship | 0.121 |
The Additive and Multiplicative Linear Models that we have built till now can capture the current effect of the KPIs. However, if we want to capture the carry-over effect, we would want to model the current revenue figures based on the past figures of the KPIs.
The Koyck tells us that the current revenue generated is not just influenced by the different independent attributes, but also because of the revenue generated over the last periods. ie. Current revenue(Yt) is also dependent on the past revenue values(Yt-1).
Yt = α + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + ϵ
Yt = α + µYt-1 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + ϵ -- (sale at time t is dependent on sale at time t-1)
If X1 is the advertising effect, β1 is the current effect of advertising, carry over effect of advertising is β1 µ/(1-µ).*
Therefore the total effect of advertising = Current effect + Carry over effect
= β1 + β1 * µ/(1-µ)
= β1/(1-µ)
homeaudio_org_df.head()
# Making copy of dataframes from the original ones
cameraaccessory_koy_df = cameraaccessory_org_df.copy()
gamingaccessory_koy_df = gamingaccessory_org_df.copy()
homeaudio_koy_df = homeaudio_org_df.copy()
homeaudio_koy_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_koy_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_koy_df.isnull().sum()/homeaudio_koy_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
GMV_Lag = ['gmv']
def lag_variables(df,var,n):
for i in var:
loc_index = df.columns.get_loc(i) + 1
df.insert(loc=loc_index,column= i+'_lag'+np.str(n),value=df[i].shift(n))
return df
cameraaccessory_koy_df = lag_variables(cameraaccessory_koy_df,GMV_Lag,1)
gamingaccessory_koy_df = lag_variables(gamingaccessory_koy_df,GMV_Lag,1)
homeaudio_koy_df = lag_variables(homeaudio_koy_df,GMV_Lag,1)
homeaudio_koy_df.head()
# Imputing all null values with 0
cameraaccessory_koy_df.fillna(value=0, inplace=True)
gamingaccessory_koy_df.fillna(value=0, inplace=True)
homeaudio_koy_df.fillna(value=0, inplace=True)
homeaudio_koy_df.head(10)
Week column as it is a row identifier and will not help in prediction of revenue¶# removing columns
cameraaccessory_koy_df = cameraaccessory_koy_df.drop('Week', axis=1)
gamingaccessory_koy_df = gamingaccessory_koy_df.drop('Week', axis=1)
homeaudio_koy_df = homeaudio_koy_df.drop('Week', axis=1)
homeaudio_koy_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_koy_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_koy_df.isnull().sum()/homeaudio_koy_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
We will use Standard scaling.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cameraaccessory_koy_df[cameraaccessory_koy_df.columns]=scaler.fit_transform(cameraaccessory_koy_df[cameraaccessory_koy_df.columns])
gamingaccessory_koy_df[gamingaccessory_koy_df.columns]=scaler.fit_transform(gamingaccessory_koy_df[gamingaccessory_koy_df.columns])
homeaudio_koy_df[homeaudio_koy_df.columns]=scaler.fit_transform(homeaudio_koy_df[homeaudio_koy_df.columns])
homeaudio_koy_df.head()
As you know, the first basic step for regression is performing a train-test split.
from sklearn.model_selection import train_test_split
# We specify this so that the train and test data set always have the same rows, respectively
cameraaccessory_train, cameraaccessory_test = train_test_split(cameraaccessory_koy_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
gamingaccessory_train, gamingaccessory_test = train_test_split(gamingaccessory_koy_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
homeaudio_train, homeaudio_test = train_test_split(homeaudio_koy_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
y_cameraaccessory_train = cameraaccessory_train.pop('gmv')
X_cameraaccessory_train = cameraaccessory_train
y_gamingaccessory_train = gamingaccessory_train.pop('gmv')
X_gamingaccessory_train = gamingaccessory_train
y_homeaudio_train = homeaudio_train.pop('gmv')
X_homeaudio_train = homeaudio_train
X_homeaudio_train.head()
y_cameraaccessory_test = cameraaccessory_test.pop('gmv')
X_cameraaccessory_test = cameraaccessory_test
y_gamingaccessory_test = gamingaccessory_test.pop('gmv')
X_gamingaccessory_test = gamingaccessory_test
y_homeaudio_test = homeaudio_test.pop('gmv')
X_homeaudio_test = homeaudio_test
X_homeaudio_test.head()
cameraaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
cameraaccessory_model = LinearRegression().fit(X_cameraaccessory_train, y_cameraaccessory_train)
y_cameraaccessory_test_pred = cameraaccessory_model.predict(X_cameraaccessory_test)
print('R2 Score: {}'.format(r2_score(y_cameraaccessory_test, y_cameraaccessory_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory_test, y_cameraaccessory_test_pred)))
cameraaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_cameraaccessory = cameraaccessory_koy_df.pop('gmv')
X_cameraaccessory = cameraaccessory_koy_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
cameraaccessory_model_cv = LinearRegression().fit(X_cameraaccessory, y_cameraaccessory)
cameraaccessory_predictions_cv = cross_val_predict(cameraaccessory_model_cv, X_cameraaccessory, y_cameraaccessory, cv=10)
accuracy = metrics.r2_score(y_cameraaccessory, cameraaccessory_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory, cameraaccessory_predictions_cv)))
cameraaccessory with model without cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
cameraaccessory_lr_model_parameters = list(cameraaccessory_model.coef_)
cameraaccessory_lr_model_parameters.insert(0, cameraaccessory_model.intercept_)
cameraaccessory_lr_model_parameters = [round(x, 3) for x in cameraaccessory_lr_model_parameters]
cols = X_cameraaccessory_test.columns
cols = cols.insert(0, "constant")
cameraaccessory_lr_coef = list(zip(cols, cameraaccessory_lr_model_parameters))
cameraaccessory_lr_coef
cameraaccessory_lr_coef_df = pd.DataFrame(cameraaccessory_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.rename(columns=col_rename)
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.iloc[1:,:]
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.loc[cameraaccessory_lr_coef_df['Coefficients']!=0.0]
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
cameraaccessory_lr_coef_df
Equation for Koyck Model:
Yt = α + µYt-1 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + ϵ -- (sale at time t is dependent on sale at time t-1)
If X1 is the advertising effect, β1 is the current effect of advertising, carry over effect of advertising is β1 µ/(1-µ).*
Therefore the total effect of advertising = Current effect + Carry over effect
= β1 + β1 * µ/(1-µ)
= β1/(1-µ)
cameraaccessory_lr_coef_df.loc[cameraaccessory_lr_coef_df['Features'] == 'gmv_lag1'].Coefficients
cameraaccessory_lr_coef_df['Total Effect'] = cameraaccessory_lr_coef_df['Coefficients']/(1-(-0.128))
cameraaccessory_lr_coef_df
cameraaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Total Effect', palette='husl', data=cameraaccessory_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_lens | 0.299 |
| product_vertical_camerabag | 0.195 |
| product_vertical_camerahousing | 0.135 |
| product_vertical_camerabattery | 0.133 |
| Online marketing | 0.127 |
gamingaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
gamingaccessory_model = LinearRegression().fit(X_gamingaccessory_train, y_gamingaccessory_train)
y_gamingaccessory_test_pred = gamingaccessory_model.predict(X_gamingaccessory_test)
print('R2 Score: {}'.format(r2_score(y_gamingaccessory_test, y_gamingaccessory_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory_test, y_gamingaccessory_test_pred)))
gamingaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_gamingaccessory = gamingaccessory_koy_df.pop('gmv')
X_gamingaccessory = gamingaccessory_koy_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
gamingaccessory_model_cv = LinearRegression().fit(X_gamingaccessory, y_gamingaccessory)
gamingaccessory_predictions_cv = cross_val_predict(gamingaccessory_model_cv, X_gamingaccessory, y_gamingaccessory, cv=10)
accuracy = metrics.r2_score(y_gamingaccessory, gamingaccessory_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory, gamingaccessory_predictions_cv)))
gamingaccessory with model without cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
gamingaccessory_lr_model_parameters = list(gamingaccessory_model.coef_)
gamingaccessory_lr_model_parameters.insert(0, gamingaccessory_model.intercept_)
gamingaccessory_lr_model_parameters = [round(x, 3) for x in gamingaccessory_lr_model_parameters]
cols = X_gamingaccessory_test.columns
cols = cols.insert(0, "constant")
gamingaccessory_lr_coef = list(zip(cols, gamingaccessory_lr_model_parameters))
gamingaccessory_lr_coef
gamingaccessory_lr_coef_df = pd.DataFrame(gamingaccessory_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.rename(columns=col_rename)
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.iloc[1:,:]
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.loc[gamingaccessory_lr_coef_df['Coefficients']!=0.0]
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
gamingaccessory_lr_coef_df
Equation for Koyck Model:
Yt = α + µYt-1 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + ϵ -- (sale at time t is dependent on sale at time t-1)
If X1 is the advertising effect, β1 is the current effect of advertising, carry over effect of advertising is β1 µ/(1-µ).*
Therefore the total effect of advertising = Current effect + Carry over effect
= β1 + β1 * µ/(1-µ)
= β1/(1-µ)
gamingaccessory_lr_coef_df.loc[gamingaccessory_lr_coef_df['Features'] == 'gmv_lag1'].Coefficients
gamingaccessory_lr_coef_df['Total Effect'] = gamingaccessory_lr_coef_df['Coefficients']/(1-(-0.015))
gamingaccessory_lr_coef_df
gamingaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Total Effect', palette='husl', data=gamingaccessory_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_gamingheadset | 0.199 |
| product_vertical_gamingheadset | 0.172 |
| is_mass_market | 0.163 |
| product_vertical_gamingmouse | 0.125 |
| product_vertical_gamingaccessorykit | 0.106 |
homeaudio¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
homeaudio_model = LinearRegression().fit(X_homeaudio_train, y_homeaudio_train)
y_homeaudio_test_pred = homeaudio_model.predict(X_homeaudio_test)
print('R2 Score: {}'.format(r2_score(y_homeaudio_test, y_homeaudio_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio_test, y_homeaudio_test_pred)))
homeaudio using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_homeaudio = homeaudio_koy_df.pop('gmv')
X_homeaudio = homeaudio_koy_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
homeaudio_model_cv = LinearRegression().fit(X_homeaudio, y_homeaudio)
homeaudio_predictions_cv = cross_val_predict(homeaudio_model_cv, X_homeaudio, y_homeaudio, cv=5)
accuracy = metrics.r2_score(y_homeaudio, homeaudio_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio, homeaudio_predictions_cv)))
homeaudio with model without cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
homeaudio_lr_model_parameters = list(homeaudio_model.coef_)
homeaudio_lr_model_parameters.insert(0, homeaudio_model.intercept_)
homeaudio_lr_model_parameters = [round(x, 3) for x in homeaudio_lr_model_parameters]
cols = homeaudio_test.columns
cols = cols.insert(0, "constant")
homeaudio_lr_coef = list(zip(cols, homeaudio_lr_model_parameters))
homeaudio_lr_coef
homeaudio_lr_coef_df = pd.DataFrame(homeaudio_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
homeaudio_lr_coef_df = homeaudio_lr_coef_df.rename(columns=col_rename)
homeaudio_lr_coef_df = homeaudio_lr_coef_df.iloc[1:,:]
homeaudio_lr_coef_df = homeaudio_lr_coef_df.loc[homeaudio_lr_coef_df['Coefficients']!=0.0]
homeaudio_lr_coef_df = homeaudio_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
homeaudio_lr_coef_df
Equation for Koyck Model:
Yt = α + µYt-1 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + ϵ -- (sale at time t is dependent on sale at time t-1)
If X1 is the advertising effect, β1 is the current effect of advertising, carry over effect of advertising is β1 µ/(1-µ).*
Therefore the total effect of advertising = Current effect + Carry over effect
= β1 + β1 * µ/(1-µ)
= β1/(1-µ)
homeaudio_lr_coef_df.loc[homeaudio_lr_coef_df['Features'] == 'gmv_lag1'].Coefficients
homeaudio_lr_coef_df['Total Effect'] = homeaudio_lr_coef_df['Coefficients']/(1-(0.060))
homeaudio_lr_coef_df
homeaudio¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 15), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Total Effect', palette='husl', data=homeaudio_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_homeaudiospeaker | 0.416 |
| is_mass_market | 0.210 |
| is_cod | 0.162 |
| NPS | 0.120 |
| Mean Temp | 0.107 |
The Additive and Multiplicative Linear Models that we have built till now can capture the current effect of the KPIs. However, if we want to capture the carry-over effect, we would want to model the current revenue figures based on the past figures of the KPIs.
In the distributed lag model, not only is the dependent variable entered in its lagged version, but the independent variables are as well. This is a more generalist model and captures the carry-over effect of all the variables. Thus, the Koyck model is a special type of distributed lag model, which includes the lag value of only the dependent variable.
Yt = α+ µ1Yt-1 + µ2Yt-2 + µ3Yt-3 + ....
+ β1X1t + β1X1t-1 + β1X1t-2 + ....
+ β2X2t + β2X2t-1 + β2X2t-2 + ....
+ β3X3t + β3X3t-1 + β3X3t-2 + ....
+ β4X4t + β4X4t-1 + β4X4t-2 + ....
+ β5X5t + β5X5t-1 + β5X5t-2 + ....
+ ϵ
homeaudio_org_df.head()
# Making copy of dataframes from the original ones
cameraaccessory_dladd_df = cameraaccessory_org_df.copy()
gamingaccessory_dladd_df = gamingaccessory_org_df.copy()
homeaudio_dladd_df = homeaudio_org_df.copy()
homeaudio_dladd_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_dladd_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_dladd_df.isnull().sum()/homeaudio_dladd_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
Week column as it is a row identifier and will not help in prediction of revenue¶# removing columns
cameraaccessory_dladd_df = cameraaccessory_dladd_df.drop('Week', axis=1)
gamingaccessory_dladd_df = gamingaccessory_dladd_df.drop('Week', axis=1)
homeaudio_dladd_df = homeaudio_dladd_df.drop('Week', axis=1)
homeaudio_dladd_df.head()
cameraaccessory_dladd_df_columns = cameraaccessory_dladd_df.columns
gamingaccessory_dladd_df_columns = gamingaccessory_dladd_df.columns
homeaudio_dladd_df_columns = homeaudio_dladd_df.columns
cameraaccessory_dladd_df = lag_variables(cameraaccessory_dladd_df,cameraaccessory_dladd_df_columns,3)
gamingaccessory_dladd_df = lag_variables(gamingaccessory_dladd_df,gamingaccessory_dladd_df_columns,3)
homeaudio_dladd_df = lag_variables(homeaudio_dladd_df,homeaudio_dladd_df_columns,3)
homeaudio_dladd_df.head()
cameraaccessory_dladd_df = lag_variables(cameraaccessory_dladd_df,cameraaccessory_dladd_df_columns,2)
gamingaccessory_dladd_df = lag_variables(gamingaccessory_dladd_df,gamingaccessory_dladd_df_columns,2)
homeaudio_dladd_df = lag_variables(homeaudio_dladd_df,homeaudio_dladd_df_columns,2)
homeaudio_dladd_df.head()
cameraaccessory_dladd_df = lag_variables(cameraaccessory_dladd_df,cameraaccessory_dladd_df_columns,1)
gamingaccessory_dladd_df = lag_variables(gamingaccessory_dladd_df,gamingaccessory_dladd_df_columns,1)
homeaudio_dladd_df = lag_variables(homeaudio_dladd_df,homeaudio_dladd_df_columns,1)
homeaudio_dladd_df.head()
# Imputing all null values with 0
cameraaccessory_dladd_df.fillna(value=0, inplace=True)
gamingaccessory_dladd_df.fillna(value=0, inplace=True)
homeaudio_dladd_df.fillna(value=0, inplace=True)
homeaudio_dladd_df.head(10)
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_dladd_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_dladd_df.isnull().sum()/homeaudio_dladd_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
We will use Standard scaling.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cameraaccessory_dladd_df[cameraaccessory_dladd_df.columns]=scaler.fit_transform(cameraaccessory_dladd_df[cameraaccessory_dladd_df.columns])
gamingaccessory_dladd_df[gamingaccessory_dladd_df.columns]=scaler.fit_transform(gamingaccessory_dladd_df[gamingaccessory_dladd_df.columns])
homeaudio_dladd_df[homeaudio_dladd_df.columns]=scaler.fit_transform(homeaudio_dladd_df[homeaudio_dladd_df.columns])
homeaudio_dladd_df.head()
As you know, the first basic step for regression is performing a train-test split.
from sklearn.model_selection import train_test_split
# We specify this so that the train and test data set always have the same rows, respectively
cameraaccessory_dladd_train, cameraaccessory_dladd_test = train_test_split(cameraaccessory_dladd_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
gamingaccessory_dladd_train, gamingaccessory_dladd_test = train_test_split(gamingaccessory_dladd_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
homeaudio_dladd_train, homeaudio_dladd_test = train_test_split(homeaudio_dladd_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
y_cameraaccessory_dladd_train = cameraaccessory_dladd_train.pop('gmv')
X_cameraaccessory_dladd_train = cameraaccessory_dladd_train
y_gamingaccessory_dladd_train = gamingaccessory_dladd_train.pop('gmv')
X_gamingaccessory_dladd_train = gamingaccessory_dladd_train
y_homeaudio_dladd_train = homeaudio_dladd_train.pop('gmv')
X_homeaudio_dladd_train = homeaudio_dladd_train
X_homeaudio_dladd_train.head()
y_cameraaccessory_dladd_test = cameraaccessory_dladd_test.pop('gmv')
X_cameraaccessory_dladd_test = cameraaccessory_dladd_test
y_gamingaccessory_dladd_test = gamingaccessory_dladd_test.pop('gmv')
X_gamingaccessory_dladd_test = gamingaccessory_dladd_test
y_homeaudio_dladd_test = homeaudio_dladd_test.pop('gmv')
X_homeaudio_dladd_test = homeaudio_dladd_test
X_homeaudio_dladd_test.head()
cameraaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
cameraaccessory_dladd_model = LinearRegression().fit(X_cameraaccessory_dladd_train, y_cameraaccessory_dladd_train)
y_cameraaccessory_dladd_test_pred = cameraaccessory_dladd_model.predict(X_cameraaccessory_dladd_test)
print('R2 Score: {}'.format(r2_score(y_cameraaccessory_dladd_test, y_cameraaccessory_dladd_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory_dladd_test, y_cameraaccessory_dladd_test_pred)))
cameraaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_cameraaccessory_dladd = cameraaccessory_dladd_df.pop('gmv')
X_cameraaccessory_dladd = cameraaccessory_dladd_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
cameraaccessory_dladd_model_cv = LinearRegression().fit(X_cameraaccessory_dladd, y_cameraaccessory_dladd)
cameraaccessory_dladd_predictions_cv = cross_val_predict(cameraaccessory_dladd_model_cv, X_cameraaccessory_dladd, \
y_cameraaccessory_dladd, cv=10)
accuracy = metrics.r2_score(y_cameraaccessory_dladd, cameraaccessory_dladd_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory_dladd, cameraaccessory_dladd_predictions_cv)))
cameraaccessory with model with cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
cameraaccessory_lr_model_parameters = list(cameraaccessory_dladd_model_cv.coef_)
cameraaccessory_lr_model_parameters.insert(0, cameraaccessory_dladd_model_cv.intercept_)
cameraaccessory_lr_model_parameters = [round(x, 3) for x in cameraaccessory_lr_model_parameters]
cols = X_cameraaccessory_dladd_test.columns
cols = cols.insert(0, "constant")
cameraaccessory_lr_coef = list(zip(cols, cameraaccessory_lr_model_parameters))
cameraaccessory_lr_coef
cameraaccessory_lr_coef_df = pd.DataFrame(cameraaccessory_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.rename(columns=col_rename)
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.iloc[1:,:]
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.loc[cameraaccessory_lr_coef_df['Coefficients']!=0.0]
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
cameraaccessory_lr_coef_df
cameraaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 35), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=cameraaccessory_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_lens | 0.121 |
| product_vertical_filter | 0.117 |
| product_vertical_camerabag | 0.102 |
| product_vertical_cameraremotecontrol | 0.094 |
| is_mass_market | 0.079 |
gamingaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
gamingaccessory_dladd_model = LinearRegression().fit(X_gamingaccessory_dladd_train, y_gamingaccessory_dladd_train)
y_gamingaccessory_dladd_test_pred = gamingaccessory_dladd_model.predict(X_gamingaccessory_dladd_test)
print('R2 Score: {}'.format(r2_score(y_gamingaccessory_dladd_test, y_gamingaccessory_dladd_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory_dladd_test, y_gamingaccessory_dladd_test_pred)))
gamingaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_gamingaccessory_dladd = gamingaccessory_dladd_df.pop('gmv')
X_gamingaccessory_dladd = gamingaccessory_dladd_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
gamingaccessory_dladd_model_cv = LinearRegression().fit(X_gamingaccessory_dladd, y_gamingaccessory_dladd)
gamingaccessory_dladd_predictions_cv = cross_val_predict(gamingaccessory_dladd_model_cv, X_gamingaccessory_dladd, \
y_gamingaccessory_dladd, cv=10)
accuracy = metrics.r2_score(y_gamingaccessory_dladd, gamingaccessory_dladd_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory_dladd, gamingaccessory_dladd_predictions_cv)))
gamingaccessory with model with cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
gamingaccessory_lr_model_parameters = list(gamingaccessory_dladd_model_cv.coef_)
gamingaccessory_lr_model_parameters.insert(0, gamingaccessory_dladd_model_cv.intercept_)
gamingaccessory_lr_model_parameters = [round(x, 3) for x in gamingaccessory_lr_model_parameters]
cols = X_gamingaccessory_dladd_test.columns
cols = cols.insert(0, "constant")
gamingaccessory_lr_coef = list(zip(cols, gamingaccessory_lr_model_parameters))
gamingaccessory_lr_coef
gamingaccessory_lr_coef_df = pd.DataFrame(gamingaccessory_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.rename(columns=col_rename)
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.iloc[1:,:]
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.loc[gamingaccessory_lr_coef_df['Coefficients']!=0.0]
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
gamingaccessory_lr_coef_df
gamingaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 35), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=gamingaccessory_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_gamepad | 0.150 |
| product_vertical_gamingaccessorykit | 0.115 |
| is_mass_market | 0.109 |
| product_vertical_motioncontroller | 0.100 |
| product_vertical_gamingkeyboard | 0.089 |
homeaudio¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
homeaudio_dladd_model = LinearRegression().fit(X_homeaudio_dladd_train, y_homeaudio_dladd_train)
y_homeaudio_dladd_test_pred = homeaudio_dladd_model.predict(X_homeaudio_dladd_test)
print('R2 Score: {}'.format(r2_score(y_homeaudio_dladd_test, y_homeaudio_dladd_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio_dladd_test, y_homeaudio_dladd_test_pred)))
homeaudio using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_homeaudio_dladd = homeaudio_dladd_df.pop('gmv')
X_homeaudio_dladd = homeaudio_dladd_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
homeaudio_dladd_model_cv = LinearRegression().fit(X_homeaudio_dladd, y_homeaudio_dladd)
homeaudio_dladd_predictions_cv = cross_val_predict(homeaudio_dladd_model_cv, X_homeaudio_dladd, y_homeaudio_dladd, cv=10)
accuracy = metrics.r2_score(y_homeaudio_dladd, homeaudio_dladd_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio_dladd, homeaudio_dladd_predictions_cv)))
homeaudio¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
homeaudio_lr_model_parameters = list(homeaudio_dladd_model_cv.coef_)
homeaudio_lr_model_parameters.insert(0, homeaudio_dladd_model_cv.intercept_)
homeaudio_lr_model_parameters = [round(x, 3) for x in homeaudio_lr_model_parameters]
cols = homeaudio_dladd_test.columns
cols = cols.insert(0, "constant")
homeaudio_lr_coef = list(zip(cols, homeaudio_lr_model_parameters))
homeaudio_lr_coef
homeaudio_lr_coef_df = pd.DataFrame(homeaudio_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
homeaudio_lr_coef_df = homeaudio_lr_coef_df.rename(columns=col_rename)
homeaudio_lr_coef_df = homeaudio_lr_coef_df.iloc[1:,:]
homeaudio_lr_coef_df = homeaudio_lr_coef_df.loc[homeaudio_lr_coef_df['Coefficients']!=0.0]
homeaudio_lr_coef_df = homeaudio_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
homeaudio_lr_coef_df
homeaudio¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 35), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=homeaudio_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_homeaudiospeaker | 0.184 |
| product_vertical_karaokeplayer | 0.173 |
| is_mass_market | 0.162 |
| is_cod | 0.145 |
| product_vertical_fmradio | 0.094 |
The Distributive Lag Model(Additive) helped us capture the not only the current, but also the carry-over effect of all the variables(depedent and independent).
Yt = α+ µ1Yt-1 + µ2Yt-2 + µ3Yt-3 + ....
+ β1X1t + β1X1t-1 + β1X1t-2 + ....
+ β2X2t + β2X2t-1 + β2X2t-2 + ....
+ β3X3t + β3X3t-1 + β3X3t-2 + ....
+ β4X4t + β4X4t-1 + β4X4t-2 + ....
+ β5X5t + β5X5t-1 + β5X5t-2 + ....
+ ϵ
Now the Distributive Lag Model(Multiplicative) will help us capture the interactions between current and carry over effects of the KPIs.
To fit a multiplicative model, take logarithms of the data(on both sides of the model), then analyse the log data as before.
Yt = α+ µ1ln(Yt-1) + µ2ln(Yt-2) + µ3ln(Yt-3) + ....
+ β1ln(X1t) + β1ln(X1t-1) + β1ln(X1t-2) + ....
+ β2ln(X2t) + β2ln(X2t-1) + β2ln(X2t-2) + ....
+ β3ln(X3t) + β3ln(X3t-1) + β3ln(X3t-2) + ....
+ β4ln(X4t) + β4ln(X4t-1) + β4ln(X4t-2) + ....
+ β5ln(X5t) + β5ln(X5t-1) + β5ln(X5t-2) + ....
+ ϵ'
homeaudio_org_df.head()
# Making copy of dataframes from the original ones
cameraaccessory_dlmul_df = cameraaccessory_org_df.copy()
gamingaccessory_dlmul_df = gamingaccessory_org_df.copy()
homeaudio_dlmul_df = homeaudio_org_df.copy()
homeaudio_dlmul_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_dladd_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_dladd_df.isnull().sum()/homeaudio_dladd_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
Week column as it is a row identifier and will not help in prediction of revenue¶# removing columns
cameraaccessory_dlmul_df = cameraaccessory_dlmul_df.drop('Week', axis=1)
gamingaccessory_dlmul_df = gamingaccessory_dlmul_df.drop('Week', axis=1)
homeaudio_dlmul_df = homeaudio_dlmul_df.drop('Week', axis=1)
homeaudio_dlmul_df.head()
cameraaccessory_dlmul_df_columns = cameraaccessory_dlmul_df.columns
gamingaccessory_dlmul_df_columns = gamingaccessory_dlmul_df.columns
homeaudio_dlmul_df_columns = homeaudio_dlmul_df.columns
cameraaccessory_dlmul_df = lag_variables(cameraaccessory_dlmul_df,cameraaccessory_dlmul_df_columns,3)
gamingaccessory_dlmul_df = lag_variables(gamingaccessory_dlmul_df,gamingaccessory_dlmul_df_columns,3)
homeaudio_dlmul_df = lag_variables(homeaudio_dlmul_df,homeaudio_dlmul_df_columns,3)
homeaudio_dlmul_df.head()
cameraaccessory_dlmul_df = lag_variables(cameraaccessory_dlmul_df,cameraaccessory_dlmul_df_columns,2)
gamingaccessory_dlmul_df = lag_variables(gamingaccessory_dlmul_df,gamingaccessory_dlmul_df_columns,2)
homeaudio_dlmul_df = lag_variables(homeaudio_dlmul_df,homeaudio_dlmul_df_columns,2)
homeaudio_dlmul_df.head()
cameraaccessory_dlmul_df = lag_variables(cameraaccessory_dlmul_df,cameraaccessory_dlmul_df_columns,1)
gamingaccessory_dlmul_df = lag_variables(gamingaccessory_dlmul_df,gamingaccessory_dlmul_df_columns,1)
homeaudio_dlmul_df = lag_variables(homeaudio_dlmul_df,homeaudio_dlmul_df_columns,1)
homeaudio_dlmul_df.head()
# Imputing all null values with 0
cameraaccessory_dlmul_df.fillna(value=0, inplace=True)
gamingaccessory_dlmul_df.fillna(value=0, inplace=True)
homeaudio_dlmul_df.fillna(value=0, inplace=True)
homeaudio_dlmul_df.head(10)
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_dladd_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_dladd_df.isnull().sum()/homeaudio_dladd_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
cameraaccessory_dlmul_df = cameraaccessory_dlmul_df.applymap(lambda x: np.log(x))
cameraaccessory_dlmul_df = cameraaccessory_dlmul_df.replace([np.inf, -np.inf], 0)
cameraaccessory_dlmul_df = cameraaccessory_dlmul_df.replace(np.nan, 0)
gamingaccessory_dlmul_df = gamingaccessory_dlmul_df.applymap(lambda x: np.log(x))
gamingaccessory_dlmul_df = gamingaccessory_dlmul_df.replace([np.inf, -np.inf], 0)
gamingaccessory_dlmul_df = gamingaccessory_dlmul_df.replace(np.nan, 0)
homeaudio_dlmul_df = homeaudio_dlmul_df.applymap(lambda x: np.log(x))
homeaudio_dlmul_df = homeaudio_dlmul_df.replace([np.inf, -np.inf], 0)
homeaudio_dlmul_df = homeaudio_dlmul_df.replace(np.nan, 0)
homeaudio_dlmul_df.head()
# Checking for total count and percentage of null values in all columns of the dataframe.
total = pd.DataFrame(homeaudio_dlmul_df.isnull().sum().sort_values(ascending=False), columns=['Total'])
percentage = pd.DataFrame(round(100*(homeaudio_dlmul_df.isnull().sum()/homeaudio_dlmul_df.shape[0]),2).sort_values(ascending=False)\
,columns=['Percentage'])
pd.concat([total, percentage], axis = 1).head()
We will use Standard scaling.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cameraaccessory_dlmul_df[cameraaccessory_dlmul_df.columns]=scaler.fit_transform(cameraaccessory_dlmul_df[cameraaccessory_dlmul_df.columns])
gamingaccessory_dlmul_df[gamingaccessory_dlmul_df.columns]=scaler.fit_transform(gamingaccessory_dlmul_df[gamingaccessory_dlmul_df.columns])
homeaudio_dlmul_df[homeaudio_dlmul_df.columns]=scaler.fit_transform(homeaudio_dlmul_df[homeaudio_dlmul_df.columns])
homeaudio_dlmul_df.head()
As you know, the first basic step for regression is performing a train-test split.
from sklearn.model_selection import train_test_split
# We specify this so that the train and test data set always have the same rows, respectively
cameraaccessory_dlmul_train, cameraaccessory_dlmul_test = train_test_split(cameraaccessory_dlmul_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
gamingaccessory_dlmul_train, gamingaccessory_dlmul_test = train_test_split(gamingaccessory_dlmul_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
homeaudio_dlmul_train, homeaudio_dlmul_test = train_test_split(homeaudio_dlmul_df, \
train_size = 0.7, test_size = 0.3, random_state = 100)
y_cameraaccessory_dlmul_train = cameraaccessory_dlmul_train.pop('gmv')
X_cameraaccessory_dlmul_train = cameraaccessory_dlmul_train
y_gamingaccessory_dlmul_train = gamingaccessory_dlmul_train.pop('gmv')
X_gamingaccessory_dlmul_train = gamingaccessory_dlmul_train
y_homeaudio_dlmul_train = homeaudio_dlmul_train.pop('gmv')
X_homeaudio_dlmul_train = homeaudio_dlmul_train
X_homeaudio_dlmul_train.head()
y_cameraaccessory_dlmul_test = cameraaccessory_dlmul_test.pop('gmv')
X_cameraaccessory_dlmul_test = cameraaccessory_dlmul_test
y_gamingaccessory_dlmul_test = gamingaccessory_dlmul_test.pop('gmv')
X_gamingaccessory_dlmul_test = gamingaccessory_dlmul_test
y_homeaudio_dlmul_test = homeaudio_dlmul_test.pop('gmv')
X_homeaudio_dlmul_test = homeaudio_dlmul_test
X_homeaudio_dlmul_test.head()
cameraaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
cameraaccessory_dlmul_model = LinearRegression().fit(X_cameraaccessory_dlmul_train, y_cameraaccessory_dlmul_train)
y_cameraaccessory_dlmul_test_pred = cameraaccessory_dlmul_model.predict(X_cameraaccessory_dlmul_test)
print('R2 Score: {}'.format(r2_score(y_cameraaccessory_dlmul_test, y_cameraaccessory_dlmul_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory_dlmul_test, y_cameraaccessory_dlmul_test_pred)))
cameraaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_cameraaccessory_dlmul = cameraaccessory_dlmul_df.pop('gmv')
X_cameraaccessory_dlmul = cameraaccessory_dlmul_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
cameraaccessory_dlmul_model_cv = LinearRegression().fit(X_cameraaccessory_dlmul, y_cameraaccessory_dlmul)
cameraaccessory_dlmul_predictions_cv = cross_val_predict(cameraaccessory_dlmul_model_cv, X_cameraaccessory_dlmul, \
y_cameraaccessory_dlmul, cv=10)
accuracy = metrics.r2_score(y_cameraaccessory_dlmul, cameraaccessory_dlmul_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_cameraaccessory_dlmul, cameraaccessory_dlmul_predictions_cv)))
cameraaccessory with model with cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
cameraaccessory_lr_model_parameters = list(cameraaccessory_dlmul_model_cv.coef_)
cameraaccessory_lr_model_parameters.insert(0, cameraaccessory_dlmul_model_cv.intercept_)
cameraaccessory_lr_model_parameters = [round(x, 3) for x in cameraaccessory_lr_model_parameters]
cols = X_cameraaccessory_dlmul_test.columns
cols = cols.insert(0, "constant")
cameraaccessory_lr_coef = list(zip(cols, cameraaccessory_lr_model_parameters))
cameraaccessory_lr_coef
cameraaccessory_lr_coef_df = pd.DataFrame(cameraaccessory_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.rename(columns=col_rename)
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.iloc[1:,:]
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.loc[cameraaccessory_lr_coef_df['Coefficients']!=0.0]
cameraaccessory_lr_coef_df = cameraaccessory_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
cameraaccessory_lr_coef_df
cameraaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 35), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=cameraaccessory_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| is_mass_market | 0.061 |
| product_vertical_lens | 0.060 |
| product_vertical_cameraaccessory | 0.060 |
| product_vertical_camerabattery | 0.059 |
| product_vertical_cameratripod | 0.059 |
gamingaccessory¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
gamingaccessory_dlmul_model = LinearRegression().fit(X_gamingaccessory_dlmul_train, y_gamingaccessory_dlmul_train)
y_gamingaccessory_dlmul_test_pred = gamingaccessory_dlmul_model.predict(X_gamingaccessory_dlmul_test)
print('R2 Score: {}'.format(r2_score(y_gamingaccessory_dlmul_test, y_gamingaccessory_dlmul_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory_dlmul_test, y_gamingaccessory_dlmul_test_pred)))
gamingaccessory using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_gamingaccessory_dlmul = gamingaccessory_dlmul_df.pop('gmv')
X_gamingaccessory_dlmul = gamingaccessory_dlmul_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
gamingaccessory_dlmul_model_cv = LinearRegression().fit(X_gamingaccessory_dlmul, y_gamingaccessory_dlmul)
gamingaccessory_dlmul_predictions_cv = cross_val_predict(gamingaccessory_dlmul_model_cv, X_gamingaccessory_dlmul, \
y_gamingaccessory_dlmul, cv=10)
accuracy = metrics.r2_score(y_gamingaccessory_dlmul, gamingaccessory_dlmul_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_gamingaccessory_dlmul, gamingaccessory_dlmul_predictions_cv)))
gamingaccessory with model with cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
gamingaccessory_lr_model_parameters = list(gamingaccessory_dlmul_model_cv.coef_)
gamingaccessory_lr_model_parameters.insert(0, gamingaccessory_dlmul_model_cv.intercept_)
gamingaccessory_lr_model_parameters = [round(x, 3) for x in gamingaccessory_lr_model_parameters]
cols = X_gamingaccessory_dlmul_test.columns
cols = cols.insert(0, "constant")
gamingaccessory_lr_coef = list(zip(cols, gamingaccessory_lr_model_parameters))
gamingaccessory_lr_coef
gamingaccessory_lr_coef_df = pd.DataFrame(gamingaccessory_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.rename(columns=col_rename)
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.iloc[1:,:]
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.loc[gamingaccessory_lr_coef_df['Coefficients']!=0.0]
gamingaccessory_lr_coef_df = gamingaccessory_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
gamingaccessory_lr_coef_df
gamingaccessory¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 35), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=gamingaccessory_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_gamepad | 0.088 |
| product_vertical_gamingmouse | 0.085 |
| is_mass_market | 0.082 |
| product_vertical_gamingkeyboard | 0.074 |
| is_cod | 0.072 |
homeaudio¶from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
homeaudio_dlmul_model = LinearRegression().fit(X_homeaudio_dlmul_train, y_homeaudio_dlmul_train)
y_homeaudio_dlmul_test_pred = homeaudio_dlmul_model.predict(X_homeaudio_dlmul_test)
print('R2 Score: {}'.format(r2_score(y_homeaudio_dlmul_test, y_homeaudio_dlmul_test_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio_dlmul_test, y_homeaudio_dlmul_test_pred)))
Here R2 is negative which signifies that the chosen model does not follow the trend of the data, so fits worse than a horizontal line. It simply means the chosen model (with its constraints) fits the data really poorly.
homeaudio using K-fold Cross Validation¶We will use GridSearchCV method and 5 fold cross validation method for our linear regression.
y_homeaudio_dlmul = homeaudio_dlmul_df.pop('gmv')
X_homeaudio_dlmul = homeaudio_dlmul_df
# Make cross validated predictions
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics
homeaudio_model_dlmul_cv = LinearRegression().fit(X_homeaudio_dlmul, y_homeaudio_dlmul)
homeaudio_dlmul_predictions_cv = cross_val_predict(homeaudio_model_dlmul_cv, X_homeaudio_dlmul, y_homeaudio_dlmul, cv=5)
accuracy = metrics.r2_score(y_homeaudio_dlmul, homeaudio_dlmul_predictions_cv)
print("Cross-Predicted Accuracy:", accuracy)
print('Mean Squared Error: {}'.format(mean_squared_error(y_homeaudio_dlmul, homeaudio_dlmul_predictions_cv)))
homeaudio with model with cv¶# linear regression model parameters
#Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
pd.set_option('display.precision',1)
homeaudio_lr_model_parameters = list(homeaudio_model_dlmul_cv.coef_)
homeaudio_lr_model_parameters.insert(0, homeaudio_model_dlmul_cv.intercept_)
homeaudio_lr_model_parameters = [round(x, 3) for x in homeaudio_lr_model_parameters]
cols = X_homeaudio_dlmul_test.columns
cols = cols.insert(0, "constant")
homeaudio_lr_coef = list(zip(cols, homeaudio_lr_model_parameters))
homeaudio_lr_coef
homeaudio_lr_coef_df = pd.DataFrame(homeaudio_lr_coef)
col_rename = {0:'Features',1: 'Coefficients'}
homeaudio_lr_coef_df = homeaudio_lr_coef_df.rename(columns=col_rename)
homeaudio_lr_coef_df = homeaudio_lr_coef_df.iloc[1:,:]
homeaudio_lr_coef_df = homeaudio_lr_coef_df.loc[homeaudio_lr_coef_df['Coefficients']!=0.0]
homeaudio_lr_coef_df = homeaudio_lr_coef_df.sort_values(by=['Coefficients'], ascending = False)
homeaudio_lr_coef_df
homeaudio¶# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(10, 35), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.barplot(y='Features', x='Coefficients', palette='husl', data=homeaudio_lr_coef_df, estimator=np.sum)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
plt.tight_layout()
# display the plot
plt.show()
| Features | Coefficients |
|---|---|
| product_vertical_homeaudiospeaker | 0.136 |
| is_mass_market | 0.133 |
| product_vertical_fmradio | 0.122 |
| is_cod | 0.112 |
| product_vertical_voicerecorder | 0.104 |
| Product Sub-category | Linear Regression Model | Cross Validation | R-square on Test Dataset | Mean Square Error |
|---|---|---|---|---|
| __cameraaccessory__ | Additive | No | 0.83 | 0.17 |
| Yes | -0.8 | 1.08 | ||
| Multiplicative | No | 0.84 | 0.36 | |
| Yes | 0.91 | 0.09 | ||
| Koyck | No | 0.84 | 0.16 | |
| Yes | 0.27 | 0.73 | ||
| Distributive Lag Model (Additive) | No | 0.87 | 0.12 | |
| Yes | 0.82 | 0.17 | ||
| Distributive Lag Model (Multiplicaitive) | No | 0.77 | 0.50 | |
| Yes | 0.82 | 0.18 | ||
| __gamingaccessory__ | Additive | No | 0.93 | 0.05 |
| Yes | 0.51 | 0.49 | ||
| Multiplicative | No | 0.94 | 0.09 | |
| Yes | 0.94 | 0.06 | ||
| Koyck | No | 0.93 | 0.05 | |
| Yes | 0.49 | 0.51 | ||
| Distributive Lag Model (Additive) | No | 0.87 | 0.10 | |
| Yes | 0.92 | 0.08 | ||
| Distributive Lag Model (Multiplicaitive) | No | 0.93 | 0.11 | |
| Yes | 0.89 | 0.11 | ||
| __homeaudio__ | Additive | No | 0.96 | 0.09 |
| Yes | 0.73 | 0.27 | ||
| Multiplicative | No | -0.63 | 0.34 | |
| Yes | 0.86 | 0.14 | ||
| Koyck | No | 0.96 | 0.09 | |
| Yes | 0.70 | 0.30 | ||
| Distributive Lag Model (Additive) | No | 0.42 | 1.39 | |
| Yes | 0.55 | 0.45 | ||
| Distributive Lag Model (Multiplicaitive) | No | -0.23 | 0.26 | |
| Yes | 0.57 | 0.43 |
R2 score & MSE score -- and the business relevance of the important attributes chosen by the model. Also we tried to choose models with cross validation because even though the ones without, sometimes gives us good scores, they are not very dependable & generalisable, owing to limited dataset.¶By referring the above model dashboard, we finalize the following models for the 3 mentioned product subcategories - Camera Accessory, Gaming Accessory & Home Audio:
| Product Sub-category | Linear Regression Model | R-square on Test Dataset | Mean Square Error | Top 5 KPIs |
|---|---|---|---|---|
| __cameraaccessory__ | Multiplicative with CV | 0.91 | 0.09 | product_vertical_lens (__0.181__) |
| product_vertical_camerabattery (__0.160__) | ||||
| is_mass_market (__0.149__) | ||||
| product_vertical_camerabatterycharger (__0.121__) | ||||
| TV (__0.105__) | ||||
| __gamingaccessory__ | Multiplicative with CV | 0.94 | 0.06 | product_vertical_gamingheadset (__0.250__) |
| is_mass_market (__0.234__) | ||||
| product_vertical_gamingmouse (__0.224__) | ||||
| product_vertical_gamepad (__0.211__) | ||||
| Online marketing_SMA_3 (__0.157__) | ||||
| __cameraaccessory__ | Multiplicative with CV | 0.86 | 0.14 | product_vertical_homeaudiospeaker (__0.469__) |
| is_mass_market (__0.289__) | ||||
| product_vertical_fmradio (__0.224__) | ||||
| Radio_Ad_Stock (__0.147__) | ||||
| Sponsorship (__0.121__) |
We notice that all the 3 chosen models for the 3 sub-categories are Multiplicative models. This fact tells us that there exists some interaction between the KPIs. These models tell us about the growth of revenue vs the interactive growth of the KPIs.
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(12, 4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(1, 3, 1)
c = [i for i in range(1,51,1)]
plt.plot(c,y_cameraaccessory_mul, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,cameraaccessory_mul_predictions_cv, color="red", linewidth=2.5, linestyle="-")
plt.suptitle('Actual vs Predicted', fontsize=16, color = 'c') # Plot heading
plt.title('Camera Accessory', fontsize=12) # Plot heading
plt.xlabel('Index', fontsize=12) # X-label
plt.ylabel('Views', fontsize=12) # Y-label
# subplot 2
plt.subplot(1, 3, 2)
c = [i for i in range(1,52,1)]
plt.plot(c,y_gamingaccessory_mul, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,gamingaccessory_mul_predictions_cv, color="red", linewidth=2.5, linestyle="-")
plt.title('Gaming Accessory', fontsize=12)
plt.xlabel('Index', fontsize=12) # X-label
plt.ylabel('Views', fontsize=12) # Y-label
# subplot 3
plt.subplot(1, 3, 3)
c = [i for i in range(1,49,1)]
plt.plot(c,y_homeaudio_mul, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,homeaudio_mul_predictions_cv, color="red", linewidth=2.5, linestyle="-")
plt.title('Home Audio', fontsize=12)
plt.xlabel('Index', fontsize=12) # X-label
plt.ylabel('Views', fontsize=12) # Y-label
# display the plot
plt.show()
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(12, 4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(1, 3, 1)
# Plotting y_test and y_pred to understand the spread.
plt.scatter(y_cameraaccessory_mul,cameraaccessory_mul_predictions_cv)
plt.suptitle('y_actual vs y_predicted', fontsize=16, color = 'c') # Plot heading
plt.title('Camera Accessory', fontsize=12) # Plot heading
plt.xlabel('y_actual', fontsize=12) # X-label
plt.ylabel('y_pred', fontsize=12) # Y-label
# subplot 2
plt.subplot(1, 3, 2)
# Plotting y_test and y_pred to understand the spread.
plt.scatter(y_gamingaccessory_mul,gamingaccessory_mul_predictions_cv)
plt.title('Camera Accessory', fontsize=12) # Plot heading
plt.xlabel('y_actual', fontsize=12) # X-label
plt.ylabel('y_pred', fontsize=12) # Y-label
# subplot 3
plt.subplot(1, 3, 3)
# Plotting y_test and y_pred to understand the spread.
plt.scatter(y_homeaudio_mul,homeaudio_mul_predictions_cv)
plt.title('Camera Accessory', fontsize=12) # Plot heading
plt.xlabel('y_actual', fontsize=12) # X-label
plt.ylabel('y_pred', fontsize=12) # Y-label
# display the plot
plt.show()
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(12, 4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(1, 3, 1)
# Error terms
c = [i for i in range(1,51,1)]
plt.scatter(c,y_cameraaccessory_mul - cameraaccessory_mul_predictions_cv)
plt.suptitle('Error Terms', fontsize=16, color = 'c') # Plot heading
plt.title('Camera Accessory', fontsize=12) # Plot heading
plt.xlabel('Index', fontsize=12) # X-label
plt.ylabel('y_act - y_pred', fontsize=12) # Y-label
# subplot 2
plt.subplot(1, 3, 2)
# Error terms
c = [i for i in range(1,52,1)]
plt.scatter(c,y_gamingaccessory_mul - gamingaccessory_mul_predictions_cv)
plt.title('Gaming Accessory', fontsize=12) # Plot heading
plt.xlabel('Index', fontsize=12) # X-label
# subplot 3
plt.subplot(1, 3, 3)
# Error terms
c = [i for i in range(1,49,1)]
plt.scatter(c,y_homeaudio_mul - homeaudio_mul_predictions_cv)
plt.title('Home Audio', fontsize=12) # Plot heading
plt.xlabel('Index', fontsize=12) # X-label
# display the plot
plt.show()
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(12, 4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(1, 3, 1)
# Plot the histogram of the error terms
sns.distplot((y_cameraaccessory_mul-cameraaccessory_mul_predictions_cv), bins = 20)
fig.suptitle('Error Terms', fontsize = 16) # Plot heading
plt.title('Camera Accessory', fontsize=12) # Plot heading
plt.xlabel('Errors', fontsize = 12) # X-label
# subplot 2
plt.subplot(1, 3, 2)
# Plot the histogram of the error terms
sns.distplot((y_gamingaccessory_mul-gamingaccessory_mul_predictions_cv), bins = 20)
plt.title('Gaming Accessory', fontsize=12) # Plot heading
plt.xlabel('Errors', fontsize = 12) # X-label
# subplot 3
plt.subplot(1, 3, 3)
# Plot the histogram of the error terms
sns.distplot((y_homeaudio_mul-homeaudio_mul_predictions_cv), bins = 20)
plt.title('Home Audio', fontsize=12) # Plot heading
plt.xlabel('Errors', fontsize = 12) # X-label
# display the plot
plt.show()
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(12, 4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
sns.set_style("white") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
# subplot 1
plt.subplot(1, 3, 1)
# Plot the histogram of the error terms
sns.regplot(y_cameraaccessory_mul, cameraaccessory_mul_predictions_cv)
plt.suptitle('Best Fitted Line', fontsize=16, color = 'c') # Plot heading
plt.title('Camera Accessory', fontsize=12) # Plot heading
plt.xlabel('y_actual', fontsize=12) # X-label
plt.ylabel('y_pred', fontsize=12) # Y-label
# subplot 2
plt.subplot(1, 3, 2)
# Plot the histogram of the error terms
sns.regplot(y_gamingaccessory_mul, gamingaccessory_mul_predictions_cv)
plt.title('Gaming Accessory', fontsize=12) # Plot heading
plt.xlabel('y_actual', fontsize=12) # X-label
plt.ylabel('y_pred', fontsize=12) # Y-label
# subplot 3
plt.subplot(1, 3, 3)
# Plot the histogram of the error terms
sns.regplot(y_homeaudio_mul, homeaudio_mul_predictions_cv)
plt.title('Home Audio', fontsize=12) # Plot heading
plt.xlabel('y_actual', fontsize=12) # X-label
plt.ylabel('y_pred', fontsize=12) # Y-label
# display the plot
plt.show()
Revenue = 0.0 + (0.181 × __product_vertical_lens__) + (0.160 × __product_vertical_camerabattery__) + (0.149 × __is_mass_market__) + (0.121 × __product_vertical_camerabatterycharger__) + (0.105 × __TV__)
Revenue = 0.0 + (0.250 × __product_vertical_gamingheadset__) + (0.234 × __is_mass_market__) + (0.224 × __product_vertical_gamingmouse__) + (0.211 × __product_vertical_gamepad__) + (0.157 × __Online marketing_SMA_3__)
Revenue = 0.0 + (0.469 × __product_vertical_homeaudiospeaker__) + (0.289 × __is_mass_market__) + (0.224 × __product_vertical_fmradio__) + (0.147 × __Radio_Ad_Stock__) + (0.121 × __Sponsorship__)
Revenue will grow with a unit growth in any of these independent KPIs with all other KPIs held constant.¶cameraaccessory_mul_lr_coef_df = cameraaccessory_mul_lr_coef_df.head().append(cameraaccessory_mul_lr_coef_df.tail())
gamingaccessory_mul_lr_coef_df = gamingaccessory_mul_lr_coef_df.head().append(gamingaccessory_mul_lr_coef_df.tail())
homeaudio_mul_lr_coef_df = homeaudio_mul_lr_coef_df.head().append(homeaudio_mul_lr_coef_df.tail())
gamingaccessory_mul_lr_coef_df
# Slightly alter the figure size to make it more horizontal.
plt.figure(figsize=(15,6), dpi=100, facecolor='w', edgecolor='k', frameon='True')
# subplot 1
plt.subplot(1, 3, 1)
sns.barplot(x='Features', y='Coefficients', color = 'tab:red',alpha=0.8, data=cameraaccessory_mul_lr_coef_df, estimator=np.sum)
plt.title('Camera Accessories', fontsize=12, alpha=0.8)
plt.xticks(rotation='vertical', fontsize=10)
# subplot 1
plt.subplot(1, 3, 2)
sns.barplot(x='Features', y='Coefficients', color = 'tab:red',alpha=0.8, data=gamingaccessory_mul_lr_coef_df, estimator=np.sum)
plt.title('Gaming Accessories', fontsize=12, alpha=0.8)
plt.xticks(rotation='vertical', fontsize=10)
# subplot 1
plt.subplot(1, 3, 3)
sns.barplot(x='Features', y='Coefficients', color = 'tab:red',alpha=0.8, data=homeaudio_mul_lr_coef_df, estimator=np.sum)
plt.title('Home Audio', fontsize=12, alpha=0.8)
plt.suptitle('Top 5 features that affect Each of the 3 Product Sub-categories(both positively and adversely) \
as per our Chosen Models', fontsize=15, color = 'r', alpha=1)
plt.xticks(rotation='vertical', fontsize=10)
# Automatically adjust subplot params so that the subplotS fits in to the figure area.
#plt.tight_layout()
# display the plot
plt.show()
Lens, Camera Batteries & Camera Battery Chargers as they fetch the highest revenue.TV has a positive impact on revenue. One unit of TV spend can boost the revenue by 0.105 units. Content Marketing spends on the other hand impacts negatively.Mass-market products are better contributors to the increased revenue in comparison to the Luxury products.Discounts in general given for this sub category works adversely towards bringing down the revenue. Gaming Headset, Gaming Mouse & Gamepad as they fetch the highest revenue. On the contrary, Gaming Memory Cards results in loss.Online Marketing, Radio & Others have a positive cumulative impact on revenue. Sponsorship spends on the other hand has a negative cumulative effect.Mass-market products are better contributors to the increased revenue in comparison to the Luxury products.Discounts in general given for this sub category works adversely towards bringing down the revenue. Home Audio Speakers & `FM Radios as they fetch the highest revenue.Mass-market products are better contributors to the increased revenue in comparison to the Luxury products.Radio Adstock (carry over effect of Radio Advertisement) spends helps to boost the revenue to a significant extent.Sponsorship has a positive impact on revenue. Content Marketing spends on the other hand impacts negatively.COD payments in general for this sub category are bad in bringing down the revenue.